Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and load checkpoints #29

Open
xrsrke opened this issue Nov 8, 2023 · 0 comments
Open

Save and load checkpoints #29

xrsrke opened this issue Nov 8, 2023 · 0 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xrsrke
Copy link
Owner

xrsrke commented Nov 8, 2023

Notes

  • Suppose we previously trained a model and saved a checkpoint using a configuration with tensor_parallel_size=2 and pipeline_parallel_size=4. Now we want to load this checkpoint and continue training, but with a new configuration that has tensor_parallel_size=4 and pipeline_parallel_size=3.

  • With merge=True, instead of each rank saving its corresponding partitions, now all checkpoints are merged into a single file and saved in a format that both an unparallelized model and a parallelized model can use to load that checkpoint.

  • With save_config=True, all configuration like tensor_parallel_size, pipeline_parallel_size, and arguments in these XParallel and DistributedOptimizer classes are saved if present.

APIs

# save checkpoints of a parallelized model
model.save_pretrained(
    save_directory="./checkpoints",
    save_config=True, # default
    save_function=torch.save, # default
    merge_checkpoints=True, # False by default
)

# load checkpoints from a parallelized model
model.from_parallelized(path="./checkpoints")
@xrsrke xrsrke changed the title Merge checkpoints Distributed Checkpoint Nov 14, 2023
@xrsrke xrsrke assigned xrsrke and unassigned xrsrke Nov 14, 2023
@xrsrke xrsrke added help wanted Extra attention is needed good first issue Good for newcomers labels Nov 14, 2023
@xrsrke xrsrke changed the title Distributed Checkpoint Save and Load Checkpoints Nov 15, 2023
@xrsrke xrsrke changed the title Save and Load Checkpoints Save and load checkpoints Nov 15, 2023
@xrsrke xrsrke removed the help wanted Extra attention is needed label Nov 27, 2023
@xrsrke xrsrke added the help wanted Extra attention is needed label Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
Status: In Progress
Development

No branches or pull requests

2 participants