Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning fails after installation from source #393

Open
1 of 2 tasks
devon-research opened this issue Mar 13, 2024 · 1 comment
Open
1 of 2 tasks

Fine-tuning fails after installation from source #393

devon-research opened this issue Mar 13, 2024 · 1 comment
Assignees
Labels

Comments

@devon-research
Copy link

devon-research commented Mar 13, 2024

System Info

I use the base Docker image pytorch/pytorch. I then run

pip install --upgrade pip setuptools
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm]

Information

  • The official example scripts
  • My own modified scripts

Code to reproduce the bug

torchrun \
    --nnodes 1 \
    --nproc_per_node 4 \
    llama-recipes/examples/finetuning.py \
        --enable_fsdp \
        --model_name meta-llama/Llama-2-7b-hf \
        --dist_checkpoint_root_folder model_checkpoints \
        --dist_checkpoint_folder fine-tuned \
        --pure_bf16 \
        --use_fast_kernels

Error logs

Traceback (most recent call last):
File "/llama-recipes/examples/finetuning.py", line 8, in <module>
  fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
  component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
  component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
  component = fn(*varargs, **kwargs)
File "/llama-recipes/src/llama_recipes/finetuning.py", line 154, in main
  model = FSDP(
TypeError: FullyShardedDataParallel.__init__() got an unexpected keyword argument 'device_mesh'

Other notes

Note that running python -c "import torch; print(torch.__version__)" yields 2.1.2+cu118. Furthermore, the output of pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm] involves uninstalling the latest PyTorch version (2.2.1) from the base image and installing an older version.

My understanding from the relevant PyTorch release notes is that the device_mesh abstraction (which is the cause of the original error above) is introduced into torch.distributed only in PyTorch 2.2. However, the requirements.txt here in llama-recipes only specifies torch>=2.0.1.

Unfortunately, simply changing the requirement to torch>=2.2 results in an error when installing llama-recipes:

Downloading vllm-0.1.3.tar.gz (102 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 35.5 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
        main()
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
        json_out['return_val'] = hook(**hook_input['kwargs'])
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
        return hook(config_settings)
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
        return self._get_build_requires(config_settings, requirements=['wheel'])
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
        self.run_setup()
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
        exec(code, locals())
      File "<string>", line 24, in <module>
    RuntimeError: Cannot find CUDA_HOME. CUDA must be available in order to build the package.

This error does not occur if the only change I make is to revert 2.2 to 2.0.1 in the requirements.txt file.

Workaround

A workaround is to simply run

pip install --index-url https://download.pytorch.org/whl/cu118 torch==2.2.1

after installing llama-recipes.

@devon-research devon-research changed the title Fine-tuning fails after default installation from source Fine-tuning fails after installation from source Mar 13, 2024
@HamidShojanazeri
Copy link
Contributor

@devon-research can you please install from src, it works on my end. BTW we did some refactor recently would be great to pull the latest first. We are planning a release soon. The HDSP, device_mesh was added recently not present in binaries yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants