Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

htyu · 2024-05-17T02:13:58Z

Summary:
The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here.

Previously in the block-wise scaled GEMM kernel, the scaling logic followed tl.load and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the dot operation, I'm moving the scaling logic before the loads.

{F1640448911}

While we should fix the compiler to be more robust, I'm making a source change as a workaround.

Differential Revision: D57473133

facebook-github-bot · 2024-05-17T02:14:06Z

This pull request was exported from Phabricator. Differential Revision: D57473133

netlify · 2024-05-17T02:14:14Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`64df84e`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/664bb373158f600008dbb278
😎 Deploy Preview	https://deploy-preview-2600--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

… scaled GEMMs (pytorch#2600) Summary: The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here. Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. {F1640448911} While we should fix the compiler to be more robust, I'm making a source change as a workaround. Differential Revision: D57473133

facebook-github-bot · 2024-05-17T02:14:49Z

This pull request was exported from Phabricator. Differential Revision: D57473133

… scaled GEMMs (pytorch#2600) Summary: The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here. Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. {F1640448911} While we should fix the compiler to be more robust, I'm making a source change as a workaround. Differential Revision: D57473133

facebook-github-bot · 2024-05-20T22:49:54Z

This pull request has been merged in 3f4d79e.

facebook-github-bot added the cla signed label May 17, 2024

facebook-github-bot added the fb-exported label May 17, 2024

htyu force-pushed the export-D57473133 branch from bb81803 to 62b52f2 Compare May 17, 2024 02:14

htyu force-pushed the export-D57473133 branch from 62b52f2 to 64df84e Compare May 20, 2024 20:32

facebook-github-bot closed this in 3f4d79e May 20, 2024

facebook-github-bot added the Merged label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

htyu commented May 17, 2024

facebook-github-bot commented May 17, 2024

netlify bot commented May 17, 2024 •

edited

facebook-github-bot commented May 17, 2024

facebook-github-bot commented May 20, 2024

Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

Conversation

htyu commented May 17, 2024

facebook-github-bot commented May 17, 2024

netlify bot commented May 17, 2024 • edited

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented May 17, 2024

facebook-github-bot commented May 20, 2024

netlify bot commented May 17, 2024 •

edited