-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support non contiguous sharding #1950
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
…torch#1952) Summary: This is to allow for usage of `from torch.testing._internal.common_distributed import spawn_threads_and_init_comms` aka threaded pg for lightweight "distributed" tests. That avoids heavier process based tests when unnecessary. Reviewed By: henrylhtsang Differential Revision: D56960671
Summary: PP requires non contiguous DMP sharding. In today's torchrec planner, there are various locations where ranks are assumed to be contiguous, this prevents intra host pipeline parallel to utilize nvlink. {F1475149088} This set of changes basically: 1. introduces `device_ranks` in `Topology` and defaults to `list(range(world_size))` which is the same as today. But caller can pass in the specific topology instead. 2. Changes list to dict in various places since this assumption no longer holds. Differential Revision: D55482028
Summary: More of an RFC diff. The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training. Let's say we have 2 DGX hosts with 16 GPUs. ## Today: We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16. This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs. ## After: We allow for logical segregated placement. E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12. To leverage intra nvswitch connect, we can do ``` [ [0 | 1 2 3], [4 | 5 6 7], [ [8 | 9 10 11], [12 | 13 14 15], ] ``` placement. That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm. Differential Revision: D55577262
This pull request was exported from Phabricator. Differential Revision: D55577262 |
Summary:
More of an RFC diff.
depends on #1837 (see diagram there)
The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.
Let's say we have 2 DGX hosts with 16 GPUs.
Today:
We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.
After:
We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.
To leverage intra nvswitch connect, we can do
placement.
That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.
Differential Revision: D55577262