How can we solve this connection error? #158

MNS57 · 2024-04-27T00:43:38Z

!torchrun --nproc_per_node 1 example_text_completion.py
--ckpt_dir ./Meta-Llama-3-8B/
--tokenizer_path ./Meta-Llama-3-8B/tokenizer.model
--max_seq_len 128 --max_batch_size 4
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [neo2]:29500 (system error: 10049 - The requested address context is invalid.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [neo2]:29500 (system error: 10049 - The requested address context is invalid.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [neo2]:29500 (system error: 10049 - The requested address context is invalid.).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [neo2]:29500 (system error: 10049 - The requested address context is invalid.).
Traceback (most recent call last):
File "C:\python_home\240405_Graph_Neural_Networks\llama3-main\example_text_completion.py", line 64, in
fire.Fire(main)
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\fire\core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\fire\core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\fire\core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C:\python_home\240405_Graph_Neural_Networks\llama3-main\example_text_completion.py", line 27, in main
generator = Llama.build(
File "C:\python_home\240405_Graph_Neural_Networks\llama3-main\llama\generation.py", line 68, in build
torch.distributed.init_process_group("nccl")
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\distributed_c10d.py", line 602, in init_process_group
default_pg = _new_process_group_helper(
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\distributed_c10d.py", line 727, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12264) of binary: C:\Users\shimo.conda\envs\Network_NLP\python.exe
Traceback (most recent call last):
File "C:\Users\shimo.conda\envs\Network_NLP\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\shimo.conda\envs\Network_NLP\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\shimo.conda\envs\Network_NLP\Scripts\torchrun.exe_main.py", line 7, in
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 345, in wrapper
return f(*args, kwargs)
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\run.py", line 724, in main
run(args)
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\run.py", line 715, in run
elastic_launch(
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\shimo.conda\envs\Network_NLP\lib\site-packages\torch\distributed\launcher\api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: