torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, internal error - please report this issue to the NCCL developers, NCCL version 2.19.3 ncclInternalError: Internal check failed. #3405
Labels
wontfix
This will not be worked on
Reminder
Reproduction
deepspeed
--include localhost:0
--master_port=9910 src/train_bash.py
--deepspeed ./ds_config.json
--stage sft
--model_name_or_path /disk1/models/Qwen/Qwen1.5-14B-Chat/
--do_train
--dataset dag_sample_shuxue_v1
--template qwen
--finetuning_type lora
--lora_rank 32
--lora_target q_proj,k_proj,v_proj,o_proj
--output_dir /output_dir
--adapter_name_or_path /checkpoint-800
--val_size 0.05
--per_device_train_batch_size 4
--per_device_eval_batch_size 2
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--optim paged_adamw_32bit
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 100
--eval_steps 100
--warmup_steps 100
--learning_rate 2e-5
--max_steps 1500
--max_grad_norm 0.5
--num_train_epochs 2.0
--seed 7321
--overwrite_output_dir True
--quantization_bit 4
--evaluation_strategy steps
--plot_loss
--fp16
环境:
pytorch 2.2.0
pytorch-cuda 12.1
nccl 2.19.3
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: