You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/llama-recipes/examples/finetuning.py", line 8, in <module>
fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/llama-recipes/src/llama_recipes/finetuning.py", line 154, in main
model = FSDP(
TypeError: FullyShardedDataParallel.__init__() got an unexpected keyword argument 'device_mesh'
Other notes
Note that running python -c "import torch; print(torch.__version__)" yields 2.1.2+cu118. Furthermore, the output of pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm] involves uninstalling the latest PyTorch version (2.2.1) from the base image and installing an older version.
My understanding from the relevant PyTorch release notes is that the device_mesh abstraction (which is the cause of the original error above) is introduced into torch.distributed only in PyTorch 2.2. However, the requirements.txt here in llama-recipes only specifies torch>=2.0.1.
Unfortunately, simply changing the requirement to torch>=2.2 results in an error when installing llama-recipes:
Downloading vllm-0.1.3.tar.gz (102 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 35.5 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 24, in <module>
RuntimeError: Cannot find CUDA_HOME. CUDA must be available in order to build the package.
This error does not occur if the only change I make is to revert 2.2 to 2.0.1 in the requirements.txt file.
The text was updated successfully, but these errors were encountered:
devon-research
changed the title
Fine-tuning fails after default installation from source
Fine-tuning fails after installation from source
Mar 13, 2024
@devon-research can you please install from src, it works on my end. BTW we did some refactor recently would be great to pull the latest first. We are planning a release soon. The HDSP, device_mesh was added recently not present in binaries yet.
System Info
I use the base Docker image pytorch/pytorch. I then run
Information
Code to reproduce the bug
Error logs
Other notes
Note that running
python -c "import torch; print(torch.__version__)"
yields2.1.2+cu118
. Furthermore, the output ofpip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm]
involves uninstalling the latest PyTorch version (2.2.1) from the base image and installing an older version.My understanding from the relevant PyTorch release notes is that the
device_mesh
abstraction (which is the cause of the original error above) is introduced intotorch.distributed
only in PyTorch 2.2. However, therequirements.txt
here inllama-recipes
only specifiestorch>=2.0.1
.Unfortunately, simply changing the requirement to
torch>=2.2
results in an error when installingllama-recipes
:This error does not occur if the only change I make is to revert
2.2
to2.0.1
in therequirements.txt
file.Workaround
A workaround is to simply run
after installing
llama-recipes
.The text was updated successfully, but these errors were encountered: