New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

support non contiguous sharding #1950

Open

xunnanxu wants to merge 3 commits into pytorch:main from xunnanxu:export-D55577262

Contributor

xunnanxu commented May 3, 2024 •

edited

Summary:
More of an RFC diff.
depends on #1837 (see diagram there)

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented May 3, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

facebook-github-bot added the fb-exported label

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request


          support non contiguous sharding (pytorch#1950)

a816c43

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from a58837a to a816c43 Compare

May 3, 2024 19:23

Contributor

facebook-github-bot commented May 3, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request


          support non contiguous sharding (pytorch#1950)

d1f153b

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from a816c43 to d1f153b Compare

May 3, 2024 19:24

Contributor

facebook-github-bot commented May 3, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from d1f153b to 78123cf Compare

May 4, 2024 00:30

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request


          support non contiguous sharding (pytorch#1950)

78123cf

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

Contributor

facebook-github-bot commented May 4, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from 78123cf to 8f0f857 Compare

May 4, 2024 00:31

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request


          support non contiguous sharding (pytorch#1950)

8f0f857

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

Contributor

facebook-github-bot commented May 4, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request


          support non contiguous sharding (pytorch#1950)

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from 8f0f857 to 8899824 Compare

May 4, 2024 18:00

Contributor

facebook-github-bot commented May 4, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added 3 commits

May 4, 2024 11:16


          add expecttest dependency to allow for pytorch core testing utils (py…

2495ffc

…torch#1952)

Summary:

This is to allow for usage of `from torch.testing._internal.common_distributed import spawn_threads_and_init_comms`
aka threaded pg for lightweight "distributed" tests.

That avoids heavier process based tests when unnecessary.

Reviewed By: henrylhtsang

Differential Revision: D56960671


          support sub group collective plan (pytorch#1837)

1ed7db1

Summary:

PP requires non contiguous DMP sharding.
In today's torchrec planner, there are various locations where ranks are assumed to be contiguous, this prevents intra host pipeline parallel to utilize nvlink.

 {F1475149088} 

This set of changes basically:
1. introduces `device_ranks` in `Topology` and defaults to `list(range(world_size))` which is the same as today. But caller can pass in the specific topology instead.
2. Changes list to dict in various places since this assumption no longer holds.

Differential Revision: D55482028


          support non contiguous sharding (pytorch#1950)

6decfe9

Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

xunnanxu force-pushed the export-D55577262 branch from 8899824 to 6decfe9 Compare

May 4, 2024 18:18

Contributor

facebook-github-bot commented May 4, 2024

This pull request was exported from Phabricator. Differential Revision: D55577262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment