Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder load and scaling code to allow latency hidding for block-wise scaled GEMMs #2600

Closed
wants to merge 1 commit into from

Conversation

htyu
Copy link
Contributor

@htyu htyu commented May 17, 2024

Summary:
The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here.

Previously in the block-wise scaled GEMM kernel, the scaling logic followed tl.load and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the dot operation, I'm moving the scaling logic before the loads.

{F1640448911}

While we should fix the compiler to be more robust, I'm making a source change as a workaround.

Differential Revision: D57473133

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57473133

Copy link

netlify bot commented May 17, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 64df84e
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/664bb373158f600008dbb278
😎 Deploy Preview https://deploy-preview-2600--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

htyu added a commit to htyu/FBGEMM that referenced this pull request May 17, 2024
… scaled GEMMs (pytorch#2600)

Summary:

The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here.

Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. 


 {F1640448911}


While we should fix the compiler to be more robust, I'm making a source change as a workaround.

Differential Revision: D57473133
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57473133

… scaled GEMMs (pytorch#2600)

Summary:

The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here.

Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. 


 {F1640448911}


While we should fix the compiler to be more robust, I'm making a source change as a workaround.

Differential Revision: D57473133
htyu added a commit to htyu/FBGEMM that referenced this pull request May 20, 2024
… scaled GEMMs (pytorch#2600)

Summary:

The compiler may not do a good job at reordering instructions for better latency hiding due to various reasons. Thus I'm tweaking the kernel code here.

Previously in the block-wise scaled GEMM kernel, the scaling logic followed `tl.load` and the compiler was not able to move the logic before the loads once the loads are pipelined. This created a situation where the scaling logic was blocked by the load barriers, which is unnecessary as they are independent. Since the barrier is only needed by the `dot` operation, I'm moving the scaling logic before the loads. 


 {F1640448911}


While we should fix the compiler to be more robust, I'm making a source change as a workaround.

Differential Revision: D57473133
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3f4d79e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants