Issue with 70B instruct #157

AdamMiltonBarker · 2024-04-26T23:52:23Z

machine Standard NC96ads A100 v4 (96 vcpus, 880 GiB memory)

W0426 23:50:13.559000 140316710531456 torch/distributed/run.py:757]
W0426 23:50:13.559000 140316710531456 torch/distributed/run.py:757] *****************************************
W0426 23:50:13.559000 140316710531456 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0426 23:50:13.559000 140316710531456 torch/distributed/run.py:757] *****************************************
> initializing model parallel with size 8
> initializing ddp with size 1
> initializing pipeline with size 1
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/tmsisa/llama3/cognitech.py", line 83, in <module>
[rank6]:     fire.Fire(main)
[rank6]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank6]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank6]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank6]:     component, remaining_args = _CallAndUpdateTrace(
[rank6]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank6]:     component = fn(*varargs, **kwargs)
[rank6]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/home/tmsisa/llama3/cognitech.py", line 58, in main
[rank6]:     generator = Llama.build(
[rank6]:                 ^^^^^^^^^^^^
[rank6]:   File "/home/tmsisa/llama3/llama/generation.py", line 75, in build
[rank6]:     torch.cuda.set_device(local_rank)
[rank6]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank6]:     torch._C._cuda_setDevice(device)
[rank6]: RuntimeError: CUDA error: invalid device ordinal
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank6]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/tmsisa/llama3/cognitech.py", line 83, in <module>
[rank4]:     fire.Fire(main)
[rank4]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank4]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank4]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank4]:     component, remaining_args = _CallAndUpdateTrace(
[rank4]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank4]:     component = fn(*varargs, **kwargs)
[rank4]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/tmsisa/llama3/cognitech.py", line 58, in main
[rank4]:     generator = Llama.build(
[rank4]:                 ^^^^^^^^^^^^
[rank4]:   File "/home/tmsisa/llama3/llama/generation.py", line 75, in build
[rank4]:     torch.cuda.set_device(local_rank)
[rank4]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank4]:     torch._C._cuda_setDevice(device)
[rank4]: RuntimeError: CUDA error: invalid device ordinal
[rank4]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank4]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank4]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/tmsisa/llama3/cognitech.py", line 83, in <module>
[rank7]:     fire.Fire(main)
[rank7]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank7]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank7]:     component, remaining_args = _CallAndUpdateTrace(
[rank7]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank7]:     component = fn(*varargs, **kwargs)
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/tmsisa/llama3/cognitech.py", line 58, in main
[rank7]:     generator = Llama.build(
[rank7]:                 ^^^^^^^^^^^^
[rank7]:   File "/home/tmsisa/llama3/llama/generation.py", line 75, in build
[rank7]:     torch.cuda.set_device(local_rank)
[rank7]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank7]:     torch._C._cuda_setDevice(device)
[rank7]: RuntimeError: CUDA error: invalid device ordinal
[rank7]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank7]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank7]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/tmsisa/llama3/cognitech.py", line 83, in <module>
[rank5]:     fire.Fire(main)
[rank5]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 143, in Fire
[rank5]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank5]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 477, in _Fire
[rank5]:     component, remaining_args = _CallAndUpdateTrace(
[rank5]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank5]:     component = fn(*varargs, **kwargs)
[rank5]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/tmsisa/llama3/cognitech.py", line 58, in main
[rank5]:     generator = Llama.build(
[rank5]:                 ^^^^^^^^^^^^
[rank5]:   File "/home/tmsisa/llama3/llama/generation.py", line 75, in build
[rank5]:     torch.cuda.set_device(local_rank)
[rank5]:   File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank5]:     torch._C._cuda_setDevice(device)
[rank5]: RuntimeError: CUDA error: invalid device ordinal
[rank5]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank5]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank5]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

W0426 23:50:18.567000 140316710531456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 126717 closing signal SIGTERM
W0426 23:50:18.568000 140316710531456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 126718 closing signal SIGTERM
W0426 23:50:18.568000 140316710531456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 126719 closing signal SIGTERM
W0426 23:50:18.568000 140316710531456 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 126720 closing signal SIGTERM
E0426 23:50:18.796000 140316710531456 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 4 (pid: 126721) of binary: /home/tmsisa/.conda/envs/llama_3_pytorch_env/bin/python
Traceback (most recent call last):
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tmsisa/.conda/envs/llama_3_pytorch_env/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cognitech.py FAILED
------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

edzq · 2024-04-27T08:36:49Z

I have the same issue.

Itime-ren · 2024-05-04T02:07:46Z

I have the same issue

nightsSeeker · 2024-05-11T11:58:40Z

@subramen same issue. what more information would you need ? cause the current state for me is:

I have pytorch and cuda 12.1 installed
model 70B-Instrcut downloaded in the correct directory as the example repo
run the torchrun command as specified

and getting the above error. i also changed the backend from nccl to gloo to account for the warnings that were appearing, maybe that has something to do with it ?

subramen · 2024-05-14T14:53:19Z

How many GPUs are you using? the 70B model will need 8GPUs to run from this repo. If you have less than 8 GPUs, please use the model from HF

Lynn1 · 2024-05-14T15:11:30Z

my solution for reference:
https://github.com/Lynn1/llama3-stream

subramen added the needs-more-information Issue is not fully clear to be acted upon label Apr 30, 2024

subramen closed this as completed May 14, 2024

subramen mentioned this issue May 15, 2024

Meta-Llama-3-70B-Instruct running out of memory on 8 A100-40GB #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with 70B instruct #157

Issue with 70B instruct #157

AdamMiltonBarker commented Apr 26, 2024 •

edited

edzq commented Apr 27, 2024

Itime-ren commented May 4, 2024

nightsSeeker commented May 11, 2024 •

edited

subramen commented May 14, 2024

Lynn1 commented May 14, 2024

Issue with 70B instruct #157

Issue with 70B instruct #157

Comments

AdamMiltonBarker commented Apr 26, 2024 • edited

edzq commented Apr 27, 2024

Itime-ren commented May 4, 2024

nightsSeeker commented May 11, 2024 • edited

subramen commented May 14, 2024

Lynn1 commented May 14, 2024

AdamMiltonBarker commented Apr 26, 2024 •

edited

nightsSeeker commented May 11, 2024 •

edited