torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

lostsollar · 2024-04-24T03:18:20Z

Reminder

I have read the README and searched the existing issues.

Reproduction

deepspeed
--include localhost:0
--master_port=9910 src/train_bash.py
--deepspeed ./ds_config.json
--stage sft
--model_name_or_path /disk1/models/Qwen/Qwen1.5-14B-Chat/
--do_train
--dataset dag_sample_shuxue_v1
--template qwen
--finetuning_type lora
--lora_rank 32
--lora_target q_proj,k_proj,v_proj,o_proj
--output_dir /output_dir
--adapter_name_or_path /checkpoint-800
--val_size 0.05
--per_device_train_batch_size 4
--per_device_eval_batch_size 2
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--optim paged_adamw_32bit
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 100
--eval_steps 100
--warmup_steps 100
--learning_rate 2e-5
--max_steps 1500
--max_grad_norm 0.5
--num_train_epochs 2.0
--seed 7321
--overwrite_output_dir True
--quantization_bit 4
--evaluation_strategy steps
--plot_loss
--fp16

环境：
pytorch 2.2.0
pytorch-cuda 12.1
nccl 2.19.3

Expected behavior

No response

System Info

No response

Others

No response

belle9217 · 2024-04-26T07:31:05Z

the same problem I encounter! how do you resolved?

hiyouga added the pending This problem is yet to be addressed. label Apr 24, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 15, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

lostsollar commented Apr 24, 2024

belle9217 commented Apr 26, 2024

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405

Comments

lostsollar commented Apr 24, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

belle9217 commented Apr 26, 2024