New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[inductor][cpp] support bf16/fp16 gemm template epilogue fusion #126545

Open

jgong5 wants to merge 24 commits into gh/jgong5/48/base from gh/jgong5/48/head

Collaborator

jgong5 commented May 17, 2024 •

edited

Stack from ghstack (oldest at bottom):

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:

bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
Support bf16/fp16 legalization for codegen_loop_bodies which is used to generate the epilogue loops.
We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
Add localize_buffer method to LocalBufferScope to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

[ghstack-poisoned]

pytorch-bot bot commented May 17, 2024 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126545

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 9 Pending, 4 Unrelated Failures

As of commit 391c86d with merge base e629259 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#127438)
sebotnet33ts_256
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#126884)
cspdarknet53
inductor / cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#127680)
gluon_inception_v3
pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added ciflow/inductor module: inductor labels

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

2fac304

ghstack-source-id: 06252d53ae18a72140a38eed8f5259c34b59259e
Pull Request resolved: #126545

jgong5 marked this pull request as draft

May 17, 2024 15:27

pytorchbot added the open source label


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

a40ed05

…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

65599cb

ghstack-source-id: 60ca4955d3e0800b8142de4e4c9f0e9675abf3ea
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

796a84b

…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

4b4957e

ghstack-source-id: 12e3fa4960f5661ba92df52819b58bcea9a20a15
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

c56342c

…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

78bf6c1

ghstack-source-id: fc6030e7b3706b0bf44bd114eabfec024fc40dbb
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

e8098ad

…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

d240903

ghstack-source-id: 88f7e5628da38239f6f6463f932dfbe1bd790a7b
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

bb65c57

…usion"

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

b31b980

ghstack-source-id: 58eab0cb80c06f5fce00a8b5f0b79c691b8349ad
Pull Request resolved: #126545

jgong5 marked this pull request as ready for review

May 19, 2024 08:54

jgong5 requested review from jansel, lezcano and peterbell10

May 19, 2024 08:54

jgong5 commented

View reviewed changes

torch/_inductor/codegen/cpp_gemm_template.py

@@ @@ -217,26 +222,30 @@ def add_choices( @@
                       input_nodes,
                       beta=1,
                       alpha=1,
+                      has_bias=False,

Collaborator Author

jgong5 May 19, 2024

Originally we use the number of input nodes to decide whether there is a bias (2: no bias, 3: with bias) but with the inputs from epilogue nodes as part of the template, there could be more inputs even if there is no bias. So we now use a dedicated flag to check that.

torch/_inductor/codegen/cpp_gemm_template.py

                       trans_w=False,
                       input_indices=None,
+                      epilogue_creator: Optional[Callable[[ir.Buffer], ir.Pointwise]] = None,

Collaborator Author

jgong5 May 19, 2024

It is used to create the in-template epilogue nodes.

torch/_inductor/codegen/cpp_gemm_template.py

+                      epilogues: List[ir.IRNode] = []
+                      if self.epilogue_creator is not None:
+                          gemm_output_name = "GemmOut"

Collaborator Author

jgong5 May 19, 2024

with in-template epilogue nodes, the gemm output could be different from the template output, i.e., gemm out -> in-template epilogues -> template output -> fused out-of-template epilogues. So we create a dedicated buffer for gemm out.

jgong5 mentioned this pull request

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

15 tasks


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

f1dc19d

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

This was referenced May 20, 2024

[inductor][cpp] GEMM template (infra and fp32) #124021

Closed

[inductor][cpp] epilogue support for gemm template #126019

Closed

[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion #126068

Open

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

458b91d

ghstack-source-id: f0c6e594e1ce72e02a3147d927b3078f56c729f0
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

7a216e5

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

e8434a5

ghstack-source-id: 97caae5901418a5c32a8c3e7e26c1997c6cb2cc5
Pull Request resolved: #126545

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

e33f0d7

ghstack-source-id: ecff738862b475558b7c05e50d7e22f1b984eb72
Pull Request resolved: #126545


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

32da249

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

jgong5 added a commit that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion

eda575e

ghstack-source-id: dee863e83e3d2d82c175bbc72d217efe9a189ad7
Pull Request resolved: #126545

Collaborator Author

jgong5 commented May 24, 2024

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented May 24, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

43baabe

pytorchmergebot removed the merging label

Collaborator

pytorchmergebot commented May 27, 2024

@jgong5 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request


          Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusi…

ed9951a

…on (#126545)"

This reverts commit 43baabe.

Reverted #126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](#124021 (comment)))

pytorchmergebot reopened this


          Update

521e47b

[ghstack-poisoned]

jgong5 mentioned this pull request

[inductor][cpp] BF16 AMX micro-gemm support #127195

Open


          Update

72f6d15

[ghstack-poisoned]

titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request


          [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (pyto…

93da7b0

…rch#126545)

As part of pytorch#125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: pytorch#126545
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#124021, pytorch#126019, pytorch#126068

titaiwangms pushed a commit to titaiwangms/pytorch that referenced this pull request


          Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusi…

9004a71

…on (pytorch#126545)"

This reverts commit 43baabe.

Reverted pytorch#126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](pytorch#124021 (comment)))

jgong5 added 4 commits

May 28, 2024 18:04


          Update

0022e38

[ghstack-poisoned]


          Update

348dbb9

[ghstack-poisoned]


          Update

771be54

[ghstack-poisoned]


          Update

40daff8

[ghstack-poisoned]

jgong5 mentioned this pull request

[cpuinfo] bump cpuinfo to the latest to support amx isa check #127505

Open

Aidyn-A pushed a commit to tinglvv/pytorch that referenced this pull request


          Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusi…

c2255a8

…on (pytorch#126545)"

This reverts commit 43baabe.

Reverted pytorch#126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](pytorch#124021 (comment)))

jgong5 added 7 commits

May 30, 2024 16:34


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

2c3f373

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

248b807

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

7cf0f68

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

c41ee73

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

f4a8cd1

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

b685ea5

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]


          Update on "[inductor][cpp] support bf16/fp16 gemm template epilogue f…

391c86d

…usion"

As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment