Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSD] Correctly handle _extra_state (#125336) #126567

Merged
merged 4 commits into from
May 27, 2024

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented May 17, 2024

Pr for 2.3 cherry-pick

Summary:
distributed_state_dict should not try to use getattr to get _extra_state as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: pytorch#125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335
Copy link

pytorch-bot bot commented May 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126567

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 67f0410 with merge base 86a2d67 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 17, 2024
@yf225 yf225 requested review from fegin and LucasLLC May 20, 2024 17:32
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the cherry pick.

@atalman
Copy link
Contributor

atalman commented May 24, 2024

@pytorchbot rebase -b release/2.3

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/release/2.3. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/release/2.3 pull/126567/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/checkpoint/test_state_dict.py
CONFLICT (content): Merge conflict in test/distributed/checkpoint/test_state_dict.py
Auto-merging torch/distributed/checkpoint/state_dict.py
error: could not apply 90619a2a174... [DSD] Correctly handle _extra_state (#125336)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 90619a2a174... [DSD] Correctly handle _extra_state (#125336)

Raised by https://github.com/pytorch/pytorch/actions/runs/9225766192

@atalman
Copy link
Contributor

atalman commented May 24, 2024

@pytorchbot rebase -b release/2.3

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/release/2.3. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/release/2.3 pull/126567/head returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/distributed/checkpoint/test_state_dict.py
CONFLICT (content): Merge conflict in test/distributed/checkpoint/test_state_dict.py
Auto-merging torch/distributed/checkpoint/state_dict.py
error: could not apply 90619a2a174... [DSD] Correctly handle _extra_state (#125336)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 90619a2a174... [DSD] Correctly handle _extra_state (#125336)

Raised by https://github.com/pytorch/pytorch/actions/runs/9225866547

@atalman atalman merged commit 00804a7 into pytorch:release/2.3 May 27, 2024
96 of 97 checks passed
@PaliC
Copy link
Contributor

PaliC commented May 31, 2024

Validated this works for cpu on the 2.3 release branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants