Add GPU implementation of QR factorization [wip] #975

nicolov · 2024-04-09T14:30:34Z

Proposed changes

Add a GPU implementation of QR factorization using the blocked Householder reflection algorithm, see:

Andrew Kerr, Dan Campbell, Mark Richards, QR Decomposition on GPUs
Jan Priessnitz, GPU acceleration of matrix factorization

Here is the reference code in numpy for the algorithm.

Left todo

clean up handling of batched inputs: slice the inputs/outputs and only pass the slice to the algorithm. Temporaries need only be sized for a single input matrix.
share some constants between the kernel and the driver function.
consider merging the two kernels to compute W.
benchmark and optimize grid/block sizes.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

mlx/backend/metal/qrf.cpp

awni · 2024-04-11T14:49:19Z

mlx/backend/metal/qrf.cpp

+      auto compute_encoder =
+          metal::CommandEncoder(command_buffer->computeCommandEncoder());


Remove this, see above

awni · 2024-04-11T14:49:30Z

mlx/backend/metal/qrf.cpp

+      compute_encoder.set_input_array(betas, 0);
+      compute_encoder.set_input_array(Y, 1);
+      compute_encoder.set_input_array(a, 2);
+      compute_encoder.set_input_array(Wp, 3);


Thanks for rebasing this is good!

awni · 2024-04-11T14:49:52Z

mlx/backend/metal/qrf.cpp

+      const MTL::Size threads_per_threadgroup(8, 1, 1);
+      compute_encoder->dispatchThreads(
+          threads_per_grid, threads_per_threadgroup);
+      compute_encoder->endEncoding();


Don't end the encoding. MLX device will handle this

awni · 2024-04-11T14:50:08Z

mlx/backend/metal/qrf.cpp

+    auto command_buffer = device.new_command_buffer(stream.index);
+
+    auto compute_encoder =
+        metal::CommandEncoder(command_buffer->computeCommandEncoder());


Same comments as above.

awni · 2024-04-11T14:50:33Z

mlx/backend/metal/qrf.cpp

+    compute_encoder->endEncoding();
+
+    device.commit_command_buffer(stream.index);


Don't end encoding or commit here. MXL device will handle it.

nicolov · 2024-04-15T11:30:24Z

@awni I tried to apply your comments and pushed 501b889 to avoid creating a new command buffer for each kernel, but I get:

-[AGXG13XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1015: failed assertion `A command encoder is already encoding to this command buffer'

awni · 2024-04-15T13:08:29Z

Did you manually make a command encoder from the command buffer? MLX manages an active command encoder so you should not make it directly. Rather call the device.get_command_encoder() to get the active encoder.

nicolov · 2024-04-15T13:23:45Z

Rather call the device.get_command_encoder() to get the active encoder.

I also tried doing that in b979ccf which just produces the wrong result.

nicolov · 2024-04-15T14:10:41Z

I also tried tracing and XCode complains about redundant bindings. Should I somehow refactor how I bind buffers to the encoder?

awni · 2024-04-15T19:24:09Z

mlx/backend/metal/qrf.cpp

+  for (int k = 0; k < batch_size; k++) {
+    for (int i = 0; i < m; i++) {
+      for (int j = 0; j < m; j++) {
+        const auto batch_offset = m * n * k;
+        const auto loc = batch_offset + colmajor_idx(i, j, m);
+        q.data<float>()[loc] = i == j ? 1 : 0;
+      }
+    }
+  }


That should probably be a kernel.

awni · 2024-04-15T19:27:51Z

mlx/utils.cpp

@@ -172,7 +172,7 @@ inline size_t elem_to_loc(

 template <typename T>
 void print_subarray(std::ostream& os, const array& a, size_t index, int dim) {
-  int num_print = 3;


Why did you change that?

Just for debugging?

nicolov · 2024-04-17T14:34:31Z

I fixed the code (needed to introduce one more kernel to ensure the atomics were synchronized properly across different threadgroups). It's a bit slow, so I'll try to improve it now:

  device     n  time_ms
0    cpu  2000    99.39
1    gpu  2000   283.36

awni · 2024-04-25T03:11:59Z

@nicolov are you planning to come back to this?

nicolov force-pushed the nicolov-qr-gpu branch from 39c8798 to e214037 Compare April 11, 2024 14:17

nicolov marked this pull request as draft April 11, 2024 14:32

nicolov force-pushed the nicolov-qr-gpu branch 2 times, most recently from 67816c3 to 5c205fb Compare April 11, 2024 14:44

awni reviewed Apr 11, 2024

View reviewed changes

mlx/backend/metal/qrf.cpp Outdated Show resolved Hide resolved

awni reviewed Apr 11, 2024

View reviewed changes

nicolov force-pushed the nicolov-qr-gpu branch from 5c205fb to b3dfd60 Compare April 15, 2024 11:21

nicolov force-pushed the nicolov-qr-gpu branch 2 times, most recently from 6aecb32 to 729e011 Compare April 15, 2024 19:12

awni reviewed Apr 15, 2024

View reviewed changes

nicolov added 2 commits April 17, 2024 15:31

Add GPU implementation of QR factorization

6702fca

Use a single command encoder

25d2a04

nicolov force-pushed the nicolov-qr-gpu branch from 729e011 to 25d2a04 Compare April 17, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU implementation of QR factorization [wip] #975

Add GPU implementation of QR factorization [wip] #975

nicolov commented Apr 9, 2024 •

edited

awni Apr 11, 2024

awni Apr 11, 2024

awni Apr 11, 2024

awni Apr 11, 2024

awni Apr 11, 2024

nicolov commented Apr 15, 2024

awni commented Apr 15, 2024

nicolov commented Apr 15, 2024

nicolov commented Apr 15, 2024

awni Apr 15, 2024

awni Apr 15, 2024

awni Apr 15, 2024

nicolov commented Apr 17, 2024

awni commented Apr 25, 2024

		auto compute_encoder =
		metal::CommandEncoder(command_buffer->computeCommandEncoder());

		compute_encoder->endEncoding();

		device.commit_command_buffer(stream.index);

Add GPU implementation of QR factorization [wip] #975

Are you sure you want to change the base?

Add GPU implementation of QR factorization [wip] #975

Conversation

nicolov commented Apr 9, 2024 • edited

Proposed changes

Left todo

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolov commented Apr 15, 2024

awni commented Apr 15, 2024

nicolov commented Apr 15, 2024

nicolov commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolov commented Apr 17, 2024

awni commented Apr 25, 2024

nicolov commented Apr 9, 2024 •

edited