Fine-tuning fails after installation from source #393

devon-research · 2024-03-13T20:03:13Z

System Info

I use the base Docker image pytorch/pytorch. I then run

pip install --upgrade pip setuptools
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm]

Information

The official example scripts
My own modified scripts

Code to reproduce the bug

torchrun \
    --nnodes 1 \
    --nproc_per_node 4 \
    llama-recipes/examples/finetuning.py \
        --enable_fsdp \
        --model_name meta-llama/Llama-2-7b-hf \
        --dist_checkpoint_root_folder model_checkpoints \
        --dist_checkpoint_folder fine-tuned \
        --pure_bf16 \
        --use_fast_kernels

Error logs

Traceback (most recent call last):
File "/llama-recipes/examples/finetuning.py", line 8, in <module>
  fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
  component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
  component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
  component = fn(*varargs, **kwargs)
File "/llama-recipes/src/llama_recipes/finetuning.py", line 154, in main
  model = FSDP(
TypeError: FullyShardedDataParallel.__init__() got an unexpected keyword argument 'device_mesh'

Other notes

Note that running python -c "import torch; print(torch.__version__)" yields 2.1.2+cu118. Furthermore, the output of pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm] involves uninstalling the latest PyTorch version (2.2.1) from the base image and installing an older version.

My understanding from the relevant PyTorch release notes is that the device_mesh abstraction (which is the cause of the original error above) is introduced into torch.distributed only in PyTorch 2.2. However, the requirements.txt here in llama-recipes only specifies torch>=2.0.1.

Unfortunately, simply changing the requirement to torch>=2.2 results in an error when installing llama-recipes:

Downloading vllm-0.1.3.tar.gz (102 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 35.5 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
        main()
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
        json_out['return_val'] = hook(**hook_input['kwargs'])
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
        return hook(config_settings)
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
        return self._get_build_requires(config_settings, requirements=['wheel'])
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
        self.run_setup()
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
        exec(code, locals())
      File "<string>", line 24, in <module>
    RuntimeError: Cannot find CUDA_HOME. CUDA must be available in order to build the package.

This error does not occur if the only change I make is to revert 2.2 to 2.0.1 in the requirements.txt file.

Workaround

A workaround is to simply run

pip install --index-url https://download.pytorch.org/whl/cu118 torch==2.2.1

after installing llama-recipes.

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2024-03-18T15:22:57Z

@devon-research can you please install from src, it works on my end. BTW we did some refactor recently would be great to pull the latest first. We are planning a release soon. The HDSP, device_mesh was added recently not present in binaries yet.

devon-research changed the title ~~Fine-tuning fails after default installation from source~~ Fine-tuning fails after installation from source Mar 13, 2024

HamidShojanazeri self-assigned this Mar 18, 2024

HamidShojanazeri added the triaged label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning fails after installation from source #393

Fine-tuning fails after installation from source #393

devon-research commented Mar 13, 2024 •

edited

HamidShojanazeri commented Mar 18, 2024

Fine-tuning fails after installation from source #393

Fine-tuning fails after installation from source #393

Comments

devon-research commented Mar 13, 2024 • edited

System Info

Information

Code to reproduce the bug

Error logs

Other notes

Workaround

HamidShojanazeri commented Mar 18, 2024

devon-research commented Mar 13, 2024 •

edited