Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多节点训练时使用nccl后端,在训练完后,保存检查点时报错 #32

Open
13416157913 opened this issue Sep 22, 2023 · 1 comment
Assignees

Comments

@13416157913
Copy link

Traceback (most recent call last):
File "/xxx/Megatron-LLaMA/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
iteration = train(forward_step_func,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 759, in train
save_checkpoint_and_time(iteration, model, optimizer,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 679, in save_checkpoint_and_time
save_checkpoint(iteration, model, optimizer, opt_param_scheduler)
File "/xxx/Megatron-LLaMA/megatron/checkpointing.py", line 373, in save_checkpoint
optimizer.save_parameter_state(
File "/xxx/Megatron-LLaMA/megatron/optimizer/overlapped_dist_optimizer.py", line 1000, in save_parameter_state
torch.distributed.gather(
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2540, in gather
work = group.gather(output_tensors, input_tensors, opts)
RuntimeError: Tensors must be CUDA and dense

@li-yi-dong li-yi-dong assigned li-yi-dong and thuhujin and unassigned li-yi-dong Sep 25, 2023
@13416157913
Copy link
Author

该问题已解决!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants