You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/gpfs/home/ljgroup/touko/Llama-Chinese/train/sft/finetune_clm_lora.py", line 692, in<module>main()
File "/gpfs/home/ljgroup/touko/Llama-Chinese/train/sft/finetune_clm_lora.py", line 281, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 123, in __init__
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1528, in __post_init__
and (self.device.type != "cuda")
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1995, in device
return self._setup_devices
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/utils/generic.py", line 56, in __get__
cached = self.fget(obj)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/transformers/training_args.py", line 1927, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/accelerate/state.py", line 190, in __init__
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 121, in __init__
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 149, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/ljgroup/touko/.conda/envs/llama_env/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
我查找了finetune_clm_lora.py也没有找到可以修改此类端口的地方
The text was updated successfully, but these errors were encountered:
运行命令
已经在运行命令中修改了deepspeed的端口,仍会出现以下报错:
我查找了finetune_clm_lora.py也没有找到可以修改此类端口的地方
The text was updated successfully, but these errors were encountered: