Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory for fsdp training #3494

Open
v-yunbin opened this issue Apr 28, 2024 · 1 comment
Open

CUDA out of memory for fsdp training #3494

v-yunbin opened this issue Apr 28, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@v-yunbin
Copy link

v-yunbin commented Apr 28, 2024

run follow train script on 4xV100x32G machine

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
    --config_file examples/accelerate/fsdp_config.yaml \
    src/train_bash.py \
    --stage sft  \
    --do_train \
    --model_name_or_path ../Chinese-LLM-Chat/models/Meta-Llama-3-70B-Instruct \
    --dataset sjcy_sft_zh,general_intension_sft_zh,in3_interaction_zh,cot_zh,sharegpt4_local,comparison_gpt4_zh  \
    --template llama3 \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir finetuned_models/intention-llama3-70b  \
    --cutoff_len 32768 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 5000 \
    --eval_steps 5000 \
    --learning_rate 5e-5 \
    --num_train_epochs 6.0 \
    --plot_loss \
    --ddp_timeout 1800000 \
    --val_size 0.001 \
    --quantization_bit 4 \
    --shift_attn \
    --rope_scaling linear \
    --fp16

error:

  File "/data/disk2/ybZhang/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/data/disk2/ybZhang/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
    self.accelerator.backward(loss)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/accelerator.py", line 2011, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB. GPU 1 has a total capacity of 31.74 GiB of which 896.38 MiB is free. Including non-PyTorch memory, this process has 30.82 GiB memory in use. Of the allocated memory 25.37 GiB is allocated by PyTorch, and 4.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. 
[2024-04-29 10:38:36,816] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23805 closing signal SIGTERM
[2024-04-29 10:38:36,821] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23807 closing signal SIGTERM
[2024-04-29 10:38:36,822] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23808 closing signal SIGTERM
[2024-04-29 10:38:47,449] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 23806) of binary: /home/ybZhang/miniconda3/envs/glm-f/bin/python
Traceback (most recent call last):
  File "/home/ybZhang/miniconda3/envs/glm-f/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1062, in launch_command
    multi_gpu_launcher(args)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_bash.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-29_10:38:36
  host      : master
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23806)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@hiyouga hiyouga added the invalid This doesn't seem right label Apr 28, 2024
@v-yunbin v-yunbin reopened this Apr 29, 2024
@hiyouga hiyouga closed this as completed Apr 29, 2024
@hiyouga hiyouga reopened this Apr 29, 2024
@hiyouga hiyouga added pending This problem is yet to be addressed. and removed invalid This doesn't seem right labels Apr 29, 2024
@v-yunbin
Copy link
Author

v-yunbin commented Apr 29, 2024

用下面的参数(去掉了rope_scaling ),可以训练

--cutoff_len 4096   
--shift_attn 

感觉加了rope_scaling 并没有起作用,没有起到内插的作用。
训练中有警告,是否影响?
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants