#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 #8614

TT-BrianLiu · 2024-05-17T18:31:22Z

Fully decouple in0 sender (ie. has in0 data) and receiver (ie. cores that produce work) grids
This means user can now width shard in0 K and specify per_core_N that divides output width on arbitrary number of cores
See tests/ttnn/sweep_tests/sweeps/sweeps/matmul/short/matmul_user_program_config_mcast_1d.py for examples

Changes:

Remove this assert: TT_FATAL(div_up(N, per_core_N) == input_tensor_a.shard_spec().value().grid.num_cores());
Separate in0 sender/recv cores into 3 kernel quadrants so all new logic is compile time
- in0_mcast_cores_with_work_and_in_receiver_grid
- in0_mcast_cores_without_work_and_in_receiver_grid
- in0_mcast_cores_without_work_and_not_in_receiver_grid
Only load compute and writer kernels onto cores that produce output work
For interleaved in0, only mcast to cores with work as well; if single core, skip mcast
Add new short matmul sweep to test mcast 1D matmul with different in0 and output grids
Fork tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp to *width_sharded.cpp
TODO: Merge these kernels back once mcast 2D matmul is uplifted to support this feature

TT-BrianLiu · 2024-05-17T18:32:55Z

This is a current draft with width sharded (mcast in0) working. Remaining TODOs:
~~- [ ] 2d mcast needs to work since it uses same in0 sharded reader kernel~~ (will do in separate PR)

Move all new logic to reader kernels to compile time
Clean up and add more tests

… for mcast 1D in0 - Fully decouple in0 sender (ie. has in0 data) and receiver (ie. cores that produce work) grids - This means user can now width shard in0 K and specify per_core_N that divides output width on arbitrary number of cores - See tests/ttnn/sweep_tests/sweeps/sweeps/matmul/short/matmul_user_program_config_mcast_1d.py for examples Changes: - Remove this assert: TT_FATAL(div_up(N, per_core_N) == input_tensor_a.shard_spec().value().grid.num_cores()); - Separate in0 sender/recv cores into 3 kernel quadrants so all new logic is compile time * in0_mcast_cores_with_work_and_in_receiver_grid * in0_mcast_cores_without_work_and_in_receiver_grid * in0_mcast_cores_without_work_and_not_in_receiver_grid - Only load compute and writer kernels onto cores that produce output work - For interleaved in0, only mcast to cores with work as well; if single core, skip mcast - Add new short matmul sweep to test mcast 1D matmul with different in0 and output grids - Fork tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp to *width_sharded.cpp - TODO: Merge these kernels back once mcast 2D matmul is uplifted to support this feature

TT-BrianLiu · 2024-05-23T16:38:40Z

Passing pipelines:

post commit all: https://github.com/tenstorrent/tt-metal/actions/runs/9210782867
post commit models tests: https://github.com/tenstorrent/tt-metal/actions/runs/9210784488
post commit ttnn unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/9210785818
device perf: https://github.com/tenstorrent/tt-metal/actions/runs/9210791946
models perf: https://github.com/tenstorrent/tt-metal/actions/runs/9210793801

TT-BrianLiu requested review from eyonland, arakhmati, cfjchu, xanderchin and bbradelTT as code owners May 17, 2024 18:31

TT-BrianLiu force-pushed the jedi branch 3 times, most recently from 4c0c512 to 04af587 Compare May 22, 2024 20:02

TT-BrianLiu changed the title ~~#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 and mcast 2D matmuls~~ #8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 May 22, 2024

TT-BrianLiu force-pushed the jedi branch 3 times, most recently from a675d69 to 9e755f9 Compare May 22, 2024 21:32

arakhmati approved these changes May 22, 2024

View reviewed changes

TT-BrianLiu force-pushed the jedi branch from 9e755f9 to a237d77 Compare May 22, 2024 21:37

TT-BrianLiu temporarily deployed to dev May 22, 2024 22:19 — with GitHub Actions Inactive

TT-BrianLiu temporarily deployed to dev May 22, 2024 22:23 — with GitHub Actions Inactive

TT-BrianLiu temporarily deployed to production May 22, 2024 22:41 — with GitHub Actions Inactive

TT-BrianLiu force-pushed the jedi branch from f23491b to 392a717 Compare May 23, 2024 16:35

TT-BrianLiu merged commit 392a717 into main May 23, 2024
5 checks passed

TT-BrianLiu temporarily deployed to dev May 23, 2024 16:38 — with GitHub Actions Inactive

TT-BrianLiu temporarily deployed to dev May 23, 2024 16:43 — with GitHub Actions Inactive

tt-rkim temporarily deployed to dev May 23, 2024 17:00 — with GitHub Actions Inactive

tt-rkim temporarily deployed to dev May 23, 2024 17:01 — with GitHub Actions Inactive

tt-rkim temporarily deployed to dev May 23, 2024 17:02 — with GitHub Actions Inactive

TT-BrianLiu temporarily deployed to production May 23, 2024 17:02 — with GitHub Actions Inactive

tt-rkim temporarily deployed to dev May 23, 2024 17:04 — with GitHub Actions Inactive

tt-rkim had a problem deploying to dev May 23, 2024 17:04 — with GitHub Actions Failure

tt-rkim temporarily deployed to dev May 23, 2024 17:04 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 #8614

#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 #8614

TT-BrianLiu commented May 17, 2024 •

edited

TT-BrianLiu commented May 17, 2024 •

edited

TT-BrianLiu commented May 23, 2024

#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 #8614

#8536: Allow in0 and output to be sharded on different grids for mcast 1D in0 #8614

Conversation

TT-BrianLiu commented May 17, 2024 • edited

TT-BrianLiu commented May 17, 2024 • edited

TT-BrianLiu commented May 23, 2024

TT-BrianLiu commented May 17, 2024 •

edited

TT-BrianLiu commented May 17, 2024 •

edited