Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

Closed
1 task done
lostsollar opened this issue Apr 24, 2024 · 1 comment
Labels
wontfix This will not be worked on

Comments

@lostsollar
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

deepspeed
--include localhost:0
--master_port=9910 src/train_bash.py
--deepspeed ./ds_config.json
--stage sft
--model_name_or_path /disk1/models/Qwen/Qwen1.5-14B-Chat/
--do_train
--dataset dag_sample_shuxue_v1
--template qwen
--finetuning_type lora
--lora_rank 32
--lora_target q_proj,k_proj,v_proj,o_proj
--output_dir /output_dir
--adapter_name_or_path /checkpoint-800
--val_size 0.05
--per_device_train_batch_size 4
--per_device_eval_batch_size 2
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--optim paged_adamw_32bit
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 100
--eval_steps 100
--warmup_steps 100
--learning_rate 2e-5
--max_steps 1500
--max_grad_norm 0.5
--num_train_epochs 2.0
--seed 7321
--overwrite_output_dir True
--quantization_bit 4
--evaluation_strategy steps
--plot_loss
--fp16

image

环境:
pytorch 2.2.0
pytorch-cuda 12.1
nccl 2.19.3

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 24, 2024
@belle9217
Copy link

the same problem I encounter! how do you resolved?

@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 15, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants