Update _dedup_save_plans.py #126569

bigning · 2024-05-17T19:52:01Z

To resolve #125740, save each tensor on the lowest rank.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

To resolve pytorch#125740, save each tensor on the lowest rank.

pytorch-bot · 2024-05-17T19:52:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126569

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

⏳ 1 Pending, 1 Unrelated Failure

As of commit c102e7f with merge base 64c581a ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

LucasLLC

At minimum, we would need this to be optional and it should not replace the current deduplication logic. The current logic is a storage optimization, and this would cause a regression in terms of calling dcp.save on models which are replica heavy.

I think in the originally linked issue, we considered not applying this logic to scalars as a fix for minimizing the number of files which need to be loaded

bigning · 2024-05-17T20:12:36Z

The current logic is a storage optimization, and this would cause a regression in terms of calling dcp.save on models which are replica heavy.

it's "optimization" only in terms of saving balance. But it hurts the loading performance in multi-node case.

we considered not applying this logic to scalars as a fix

i don't think that works. The root cause is the duplicated tensors are saved in different files, no matter if it's scalar tensor or not.

bigning · 2024-05-17T20:20:14Z

@LucasLLC , i replied to your comment. Could you please take a look ?

LucasLLC · 2024-05-20T13:53:35Z

@bigning , I believe this generally this isn't seen as a large issue during loading since files are all expected to live in the same NFS directory. Additionally, I think we would prioritize saving latency over loading at least in this case since users are typically saving much more often. Could we change this PR to make the de-duplication optional (and true by default)?

Skylion007 · 2024-05-20T14:31:32Z

torch/distributed/checkpoint/_dedup_save_plans.py

-        # essentially ignores the storage size of anything that is not a tensor, since
-        # we don't know how much storage they represent
-        plan_to_size[select_plan_idx] += write_item.tensor_storage_size() or 1
+        select_plan_idx = min(plan_indices, key=lambda plan_idx: plan_idx)


Suggested change

select_plan_idx = min(plan_indices, key=lambda plan_idx: plan_idx)

select_plan_idx = min(plan_indices)

bigning · 2024-05-20T16:53:48Z

@LucasLLC

make the de-duplication optional

If we skip the dedup, it fails here https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/default_planner.py#L386-L387.

files are all expected to live in the same NFS directory.

For NFS, the issue is not about if the file exists in the NFS. It's that each node needs to download the Nx times more files. If it's using cloud, this introduces:

Nx cloud storage money cost for downloading more data
Nx more time blocking the training. For saving, it can be async, but for loading, usually loading checkpointing blocks the training.
Downloading large files from cloud is not stable, this is really a pain point for using cloud storage. The more files you download, the more chance the checkpointing downloading fails.

Can I just add a save_to_same_rank param to dedup_save_plans and DefaultSavePlanner ?

LucasLLC · 2024-05-21T00:23:15Z

Can I just add a save_to_same_rank param to dedup_save_plans and DefaultSavePlanner

@bigning I understand the pain points in having to download multiple files. I think we would accept a PR which makes this behavior configurable.

bigning · 2024-05-21T19:06:29Z

makes this behavior configurable.

@LucasLLC , this makes sense. I added a param to dedup_save_plans and DefaultSavePlanner, could you take a look? Thanks!

LucasLLC · 2024-05-22T21:35:20Z

Thanks @bigning ! This looks good to me. Will merge if tests pass

bigning · 2024-05-23T16:21:40Z

thanks @LucasLLC , it seems two lints failed, I can't find useful error message, do you know how to fix or how to re-run?

bigning · 2024-05-23T21:26:37Z

thanks @LucasLLC , it seems two lints failed, I can't find useful error message, do you know how to fix or how to re-run?

NVM, i just submitted another commit. Looks all tests are green now.

bigning · 2024-05-29T16:22:30Z

@LucasLLC , can you help merge?

LucasLLC · 2024-05-31T14:59:34Z

@pytorchbot merge

pytorch-bot · 2024-05-31T14:59:38Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

bigning · 2024-05-31T17:06:06Z

@pytorchbot merge

pytorch-bot · 2024-05-31T17:06:10Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

LucasLLC · 2024-05-31T19:52:28Z

@pytorchbot merge

pytorchmergebot · 2024-05-31T19:54:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-31T20:10:19Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-13-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

Update _dedup_save_plans.py

f4751f1

To resolve pytorch#125740, save each tensor on the lowest rank.

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 17, 2024

LucasLLC requested changes May 17, 2024

View reviewed changes

pytorchbot added the open source label May 17, 2024

bigning mentioned this pull request May 17, 2024

[Distributed Checkpoint] When loading FSDP sharded checkpointing each rank needs all the checkpointing files #125740

Open

bigning requested a review from LucasLLC May 17, 2024 20:20

Skylion007 reviewed May 20, 2024

View reviewed changes

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

bigning added 3 commits May 21, 2024 12:00

Update _dedup_save_plans.py

fa870aa

Update default_planner.py

2385852

Update _dedup_save_plans.py

6eeef5d

LucasLLC approved these changes May 22, 2024

View reviewed changes

Update _dedup_save_plans.py

54ef227

lint

e4829bd

Merge branch 'pytorch:main' into patch-2

c102e7f

LucasLLC added the topic: not user facing topic category label May 31, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 31, 2024

pytorchmergebot added the merging label May 31, 2024

pytorchmergebot removed the merging label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update _dedup_save_plans.py #126569

Update _dedup_save_plans.py #126569

bigning commented May 17, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 •

edited

LucasLLC left a comment

bigning commented May 17, 2024

bigning commented May 17, 2024

LucasLLC commented May 20, 2024 •

edited

Skylion007 May 20, 2024

bigning commented May 20, 2024 •

edited

LucasLLC commented May 21, 2024

bigning commented May 21, 2024

LucasLLC commented May 22, 2024

bigning commented May 23, 2024

bigning commented May 23, 2024

bigning commented May 29, 2024

LucasLLC commented May 31, 2024

pytorch-bot bot commented May 31, 2024

bigning commented May 31, 2024

pytorch-bot bot commented May 31, 2024

LucasLLC commented May 31, 2024

pytorchmergebot commented May 31, 2024

pytorchmergebot commented May 31, 2024

	select_plan_idx = min(plan_indices, key=lambda plan_idx: plan_idx)
	select_plan_idx = min(plan_indices)

Update _dedup_save_plans.py #126569

Are you sure you want to change the base?

Update _dedup_save_plans.py #126569

Conversation

bigning commented May 17, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126569

❗ 1 Active SEVs

⏳ 1 Pending, 1 Unrelated Failure

LucasLLC left a comment

Choose a reason for hiding this comment

bigning commented May 17, 2024

bigning commented May 17, 2024

LucasLLC commented May 20, 2024 • edited

Skylion007 May 20, 2024

Choose a reason for hiding this comment

bigning commented May 20, 2024 • edited

LucasLLC commented May 21, 2024

bigning commented May 21, 2024

LucasLLC commented May 22, 2024

bigning commented May 23, 2024

bigning commented May 23, 2024

bigning commented May 29, 2024

LucasLLC commented May 31, 2024

pytorch-bot bot commented May 31, 2024

bigning commented May 31, 2024

pytorch-bot bot commented May 31, 2024

LucasLLC commented May 31, 2024

pytorchmergebot commented May 31, 2024

Merge started

pytorchmergebot commented May 31, 2024

Merge failed

bigning commented May 17, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 17, 2024 •

edited

LucasLLC commented May 20, 2024 •

edited

bigning commented May 20, 2024 •

edited