多节点训练时使用nccl后端，在训练完后，保存检查点时报错 #32

13416157913 · 2023-09-22T11:17:30Z

Traceback (most recent call last):
File "/xxx/Megatron-LLaMA/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
iteration = train(forward_step_func,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 759, in train
save_checkpoint_and_time(iteration, model, optimizer,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 679, in save_checkpoint_and_time
save_checkpoint(iteration, model, optimizer, opt_param_scheduler)
File "/xxx/Megatron-LLaMA/megatron/checkpointing.py", line 373, in save_checkpoint
optimizer.save_parameter_state(
File "/xxx/Megatron-LLaMA/megatron/optimizer/overlapped_dist_optimizer.py", line 1000, in save_parameter_state
torch.distributed.gather(
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2540, in gather
work = group.gather(output_tensors, input_tensors, opts)
RuntimeError: Tensors must be CUDA and dense

13416157913 · 2023-09-25T03:04:22Z

该问题已解决！

li-yi-dong assigned li-yi-dong and thuhujin and unassigned li-yi-dong Sep 25, 2023

13416157913 mentioned this issue Oct 7, 2023

solve the RuntimeError: Tensors must be CUDA and dense #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多节点训练时使用nccl后端，在训练完后，保存检查点时报错 #32

多节点训练时使用nccl后端，在训练完后，保存检查点时报错 #32

13416157913 commented Sep 22, 2023

13416157913 commented Sep 25, 2023

多节点训练时使用nccl后端，在训练完后，保存检查点时报错 #32

多节点训练时使用nccl后端，在训练完后，保存检查点时报错 #32

Comments

13416157913 commented Sep 22, 2023

13416157913 commented Sep 25, 2023