You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/data/disk2/ybZhang/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/data/disk2/ybZhang/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/transformers/trainer.py", line 3045, in training_step
self.accelerator.backward(loss)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/accelerator.py", line 2011, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB. GPU 1 has a total capacity of 31.74 GiB of which 896.38 MiB is free. Including non-PyTorch memory, this process has 30.82 GiB memory in use. Of the allocated memory 25.37 GiB is allocated by PyTorch, and 4.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
[2024-04-29 10:38:36,816] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23805 closing signal SIGTERM
[2024-04-29 10:38:36,821] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23807 closing signal SIGTERM
[2024-04-29 10:38:36,822] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 23808 closing signal SIGTERM
[2024-04-29 10:38:47,449] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 23806) of binary: /home/ybZhang/miniconda3/envs/glm-f/bin/python
Traceback (most recent call last):
File "/home/ybZhang/miniconda3/envs/glm-f/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1062, in launch_command
multi_gpu_launcher(args)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ybZhang/miniconda3/envs/glm-f/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_bash.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-29_10:38:36
host : master
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 23806)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The text was updated successfully, but these errors were encountered:
run follow train script on 4xV100x32G machine
error:
The text was updated successfully, but these errors were encountered: