Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support non contiguous sharding #1950

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

xunnanxu
Copy link
Contributor

@xunnanxu xunnanxu commented May 3, 2024

Summary:
More of an RFC diff.
depends on #1837 (see diagram there)

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 3, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 3, 2024
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 3, 2024
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 4, 2024
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 4, 2024
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 4, 2024
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

…torch#1952)

Summary:

This is to allow for usage of `from torch.testing._internal.common_distributed import spawn_threads_and_init_comms`
aka threaded pg for lightweight "distributed" tests.

That avoids heavier process based tests when unnecessary.

Reviewed By: henrylhtsang

Differential Revision: D56960671
Summary:

PP requires non contiguous DMP sharding.
In today's torchrec planner, there are various locations where ranks are assumed to be contiguous, this prevents intra host pipeline parallel to utilize nvlink.

 {F1475149088} 

This set of changes basically:
1. introduces `device_ranks` in `Topology` and defaults to `list(range(world_size))` which is the same as today. But caller can pass in the specific topology instead.
2. Changes list to dict in various places since this assumption no longer holds.

Differential Revision: D55482028
Summary:

More of an RFC diff.

The high level idea is we want to disagg the dense and sparse tower placement in rec model distributed training.

Let's say we have 2 DGX hosts with 16 GPUs.

## Today:

We flat shard DMP/FSDP onto the 16 GPUs. A2A/AG/RS would be world size of 16.
This poses challenge on scalability as the model would quickly be comm bound above 128 GPUs.

## After:

We allow for logical segregated placement.
E.g. for the same 16 GPUs, we can do 1:3 split and place sparse onto 4, and dense onto 12.

To leverage intra nvswitch connect, we can do

```
[
  [0 | 1 2 3],
  [4 | 5 6 7],
[
  [8 | 9 10 11],
  [12 | 13 14 15],
]
```

placement.

That way, the world size becomes 4 and 12 respectively. And across them we use P2P comm.

Differential Revision: D55577262
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55577262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants