Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用zero3_offload+序列并行训练yi-34b的时候出错 #589

Open
puppet101 opened this issue Apr 19, 2024 · 19 comments
Open

使用zero3_offload+序列并行训练yi-34b的时候出错 #589

puppet101 opened this issue Apr 19, 2024 · 19 comments

Comments

@puppet101
Copy link

我只用github上提供的配置文件yi_34b_200k_full_alpaca_enzh_32k_sp8,运行时的deepspeed选项是zero3_offload
但是出现如下错误,请问现在序列并行是不支持offload吗,还是有别的原因? 谢谢。

Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
Traceback (most recent call last):
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
File "/opt/ml/job/xtuner/tools/train.py", line 342, in
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()main()

File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
runner.train()runner.train()

File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
optim_wrapper.update_params(parsed_loss)
outputs = self.runner.model.train_step( File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params

File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
optim_wrapper.update_params(parsed_loss)
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self.step()
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
return wrapped(*args, **kwargs)self.model.step()
Traceback (most recent call last):

return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
File "/opt/ml/job/xtuner/tools/train.py", line 342, in

Traceback (most recent call last):
return wrapped(*args, **kwargs) File "/opt/ml/job/xtuner/tools/train.py", line 342, in
self.model.step() File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step

File "/opt/ml/job/xtuner/tools/train.py", line 342, in

File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
self._take_model_step(lr_kwargs)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self._take_model_step(lr_kwargs) File "/opt/ml/job/xtuner/tools/train.py", line 342, in
self._take_model_step(lr_kwargs)runner.train()

File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
runner.train() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train

File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.run_iter(data_batch)
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.optimizer.step()model = self.train_loop.run() # type: ignore

  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
main()
File "/opt/ml/job/xtuner/tools/train.py", line 338, in main
model = self.train_loop.run() # type: ignoreself.optimizer.step()

self.optimizer.step() ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run

File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step

  File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step

self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
runner.train()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/_flexible_runner.py", line 1200, in train
optim_wrapper.update_params(parsed_loss)
self.run_iter(data_batch)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 135, in train_step
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
self.step()
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self.model.step()optim_wrapper.update_params(parsed_loss)

File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step
model = self.train_loop.run() # type: ignore
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 286, in run
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)self.model.step()

  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step

  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return wrapped(*args, **kwargs)
ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 95, in step

    self.run_iter(data_batch)  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads

ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 309, in run_iter
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads

  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads

self.model.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.10/dist-packages/mmengine/strategy/deepspeed.py", line 135, in train_step
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul
(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
optim_wrapper.update_params(parsed_loss)
File "/usr/local/lib/python3.10/dist-packages/mmengine/_strategy/deepspeed.py", line 83, in update_params
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.step()
File "/usr/local/lib/python3.10/dist-packages/mmengine/optim/scheduler/param_scheduler.py", line 115, in wrapper
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mmengine/strategy/deepspeed.py", line 95, in step
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul
(1. / combined_scale)
RuntimeErrorself.model.step():
Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cpu! self.fp32_partitioned_groups_flat[sub_group_id].grad.mul
(1. / combined_scale) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2169, in step

self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)

RuntimeErrorRuntimeErrorself._take_model_step(lr_kwargs): : 

Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu!Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu! File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step

self.optimizer.step()

File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.take_model_step(lr_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul
(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul
(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1336) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 19, 2024

很抱歉给您的使用带来了不便!

我刚刚测试了下 Yi-200K-34B 的 CPU_Offload 全量微调,没有成功复现出您的问题。以下是我序列并行度为2的config文件和训练Config & Log:
34b_8k_sp2_config
34b_8k_sp2_log.log

理论上序列并行跟CPU Offload是不会互相影响的。所以需要您先将序列并行关掉(通过设置sequence_parallel_size = 1)再测试 CPU Offload 训练,看看是否会报错。如果仍然报错,可能需要您检查您的环境是否安装正确。

有进一步的结果后欢迎联系我们!

@puppet101
Copy link
Author

您好,感谢回复,我这边试了一下8k的sp2,但是还是同样的问题,可以提供一下您那边的运行环境吗? 我现在的配置文件是:
yi_34b_200k_full_alpaca_zh_32k_sp8.log
运行环境是:
deepspeed 0.14.1
transformers 4.40.0
xtuner 0.1.18.dev0
torch 2.0.0+cu118

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 22, 2024

您好,感谢回复,我这边试了一下8k的sp2,但是还是同样的问题,可以提供一下您那边的运行环境吗? 我现在的配置文件是: yi_34b_200k_full_alpaca_zh_32k_sp8.log 运行环境是: deepspeed 0.14.1 transformers 4.40.0 xtuner 0.1.18.dev0 torch 2.0.0+cu118

麻烦先尝试把序列并行关掉(通过设置sequence_parallel_size = 1)再测试 CPU Offload 训练,看看是否会报错。我有点担心不是序列并行引入的bug

@puppet101
Copy link
Author

您好,我这边确认问题了,我之前不论怎么改序列并行的设置,都会报一样的错误。我后来把deepspeed的版本从0.14.0降到0.12.3,就没问题了,感谢耐心的解答哈!

另外我还有个问题,就是我这边虽然能跑起来了,但是我发现训练的步长有问题,我把的设置如下:
sequence_parallel_size=8
batch_size = 1
accumulative_counts = 8
max_epochs = 3
使用alpaca_ch这个数据集,发现训练的总步数只有32,这个感觉不太对啊,alpaca-data-gpt4-chinese这个数据集,总共有5万多个样本,3个epoch,不应该总步数只有32的,辛苦帮忙看一下,谢谢!

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 22, 2024

问下你的序列长度 (max_length) 设的多少呢

@puppet101
Copy link
Author

puppet101 commented Apr 22, 2024

长度是4096,这个默认会多个样本拼接到一起吗?我刚刚改成了8192,总步长现在还是32。。。但是显存占用的确是增加了

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 22, 2024

https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/huggingface.py#L96 这行代码改成

num_proc=1

再试试呢?

@puppet101
Copy link
Author

改了之后还是没有变化,您能简单介绍一下这个32是怎么计算来的吗? 谢谢~

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 22, 2024

在数据预处理的时候,我们通过 Huggingface datasets 的 map_fn 接口实现数据拼接功能, 默认使用32个进程同时预处理。某个进程默认一次输入1000条数据,并将其拼接为多条长数据,最后余下的部分会被舍弃。

如果 max_length 较大,舍弃的部分就越多。但我感觉8192不算太长,应该不会导致数据集大幅度下降。

我这边建议先把Huggingface datasets 的缓存清掉,默认是在 ~/.cache/huggingface/datasets/ 目录下,之后把 https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/huggingface.py#L96 这行代码改成 num_proc=1 以避免数据丢失。再重新处理一遍数据集,看看是否仍然是 32 条数据。

@puppet101
Copy link
Author

好的,我试一下,我看您发的那个log里面,好像也是32个步数,感觉不是我自己的个例

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 23, 2024

好的,我试一下,我看您发的那个log里面,好像也是32个步数,感觉不是我自己的个例

我才想起来,我这个 log 里是32步是因为我设置了:

train_cfg = dict(type=TrainLoop, max_iters=32)

之前为了测速,只跑了前 32 个 iters。

这是我更新后的 config 和 log:
34b_32k_sp2_config.txt
yi_34b_32k_sp2_log.txt

为了检查您训练 iter 只有 32 的问题,可能麻烦您:

  1. 先检查 config 中 train_cfg 的设置,看看是否设置了 max_iters 这个参数
  2. 进入 ~/.cache/huggingface/datasets 目录下,应该会有 silk-road___alpaca-data-gpt4-chinesetatsu-lab___alpaca 两个文件夹分别对应 alpaca_zh 和 alpaca。删除里面的所有 cache-* 的文件(map_fn的中间产物)rm tatsu-lab___alpaca/default/0.0.0/dce01c9b08f87459cf36a430d809084718273017/cache-* 以及 rm silk-road___alpaca-data-gpt4-chinese/default/0.0.0/81a6dfd72f416aff605e7d189bfbbc46a2511fee/cache-* (注意替换 hash 值为您的 hash 值)。删除后再重新运行。

如果仍然有问题,欢迎继续讨论!

@puppet101
Copy link
Author

max_iters的确是32,我把她改成max_epochs的值就可以了,现在已经可以正常跑起来了,非常感谢您的耐心解答!

@HIT-cwh HIT-cwh closed this as completed Apr 24, 2024
@puppet101
Copy link
Author

@HIT-cwh 您好,我这边又遇到新的问题了。。。就是在一个epoch结束之后,就会oom,而且这个问题是稳定出现的。就是在一个epoch结束后的第7个step,我配置的序列并行数量是8。这是应该是没有在保存模型,就是比较正常的一次迭代,是不是不同数据的前后拼接时候出了问题呢?
04/24 17:10:53 - mmengine - INFO - Iter(train) [31/96] lr: 1.6222e-05 eta: 0:11:18 time: 6.3558 data_time: 0.0073 memory: 7805 loss: 1.5815 tflops: 31.7220 tokens_per_sec: 109.9789
04/24 17:11:38 - mmengine - INFO - Exp name: qwen1.5_32b_full_alpaca_zh_32k_sp8_20240424_165414
04/24 17:11:38 - mmengine - INFO - Iter(train) [32/96] lr: 1.5989e-05 eta: 0:12:16 time: 44.8117 data_time: 0.0075 memory: 7970 loss: 1.4993 tflops: 5.5519 tokens_per_sec: 19.0352
04/24 17:11:38 - mmengine - WARNING - Reach the end of the dataloader, it will be restarted and continue to iterate. It is recommended to use mmengine.dataset.InfiniteSampler to enable the dataloader to iterate infinitely.
04/24 17:11:48 - mmengine - INFO - Iter(train) [33/96] lr: 1.5750e-05 eta: 0:12:02 time: 9.9696 data_time: 3.2989 memory: 7895 loss: 1.2497 tflops: 29.2799 tokens_per_sec: 99.4019
04/24 17:11:54 - mmengine - INFO - Iter(train) [34/96] lr: 1.5507e-05 eta: 0:11:41 time: 6.3881 data_time: 0.0065 memory: 7805 loss: 1.3924 tflops: 31.5616 tokens_per_sec: 109.4228
04/24 17:12:01 - mmengine - INFO - Iter(train) [35/96] lr: 1.5259e-05 eta: 0:11:21 time: 6.3998 data_time: 0.0062 memory: 7870 loss: 1.3473 tflops: 38.8749 tokens_per_sec: 133.2853
04/24 17:12:07 - mmengine - INFO - Iter(train) [36/96] lr: 1.5008e-05 eta: 0:11:02 time: 6.3611 data_time: 0.0078 memory: 7833 loss: 1.1118 tflops: 34.9025 tokens_per_sec: 120.4190
04/24 17:12:13 - mmengine - INFO - Iter(train) [37/96] lr: 1.4753e-05 eta: 0:10:44 time: 6.5525 data_time: 0.0073 memory: 7914 loss: 1.2245 tflops: 42.9649 tokens_per_sec: 146.2044
04/24 17:12:20 - mmengine - INFO - Iter(train) [38/96] lr: 1.4494e-05 eta: 0:10:26 time: 6.3583 data_time: 0.0055 memory: 7810 loss: 1.2365 tflops: 32.3295 tokens_per_sec: 111.9793
04/24 17:12:26 - mmengine - INFO - Iter(train) [39/96] lr: 1.4232e-05 eta: 0:10:08 time: 6.3835 data_time: 0.0062 memory: 7827 loss: 1.1699 tflops: 33.9666 tokens_per_sec: 117.3344
在这里就会oom了,我设置的epoch是3, 32步一个epoch,数据集用的是alpaca_ch

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 24, 2024

问下你设置的梯度累积值是多少呢?另外问下您的显卡显存是多少呢,我看log里打印的是小于8G的显存占用

@puppet101
Copy link
Author

accumulative_counts 和sequence_parallel_size是一样的值,我8和4都试过了,都是在一个epoch结束的第accumulative_counts-1个step的时候,必然oom。
另外这个oom不是显存,是内存,我内存是1t的,我的显存是40g的,但是并没有出现显存溢出。
我把样本数改小了,也是会出现一样的问题。

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 25, 2024

我确认下哈,您这边用的配置是 Yi34B + 32k seq length + sequence parallel size 4 (8) + deepspeed zero3 offload 吧?

我这边尝试复现下您的问题

@HIT-cwh HIT-cwh reopened this Apr 25, 2024
@puppet101
Copy link
Author

嗯好的,我这边是 Yi34B + 24k seq length(12k也试过) + sequence parallel size 4 (8) + deepspeed zero3 offload,即使数据集很小也能复现,辛苦~

@HIT-cwh
Copy link
Collaborator

HIT-cwh commented Apr 25, 2024

我复现出来您的问题了,我进行了两组实验,都使用的是 yi 34b + deepspeed zero3 offload + 8 * A100 80G (1T mem):

  1. 8k seq len + sequence parallel 4 + grad acc 4
  2. 2k seq len + sequence parallel 1 + grad acc 1

这两组实验在不使用 cpu Offload的情况下,理论上显存占用是很接近的(8k seq len / seq parallel 4 = 2k)

但是这两组实验在第 2 个 epoch 第 accumulative_counts-1 个 step 的时候均会遇到内存 OOM ,说明应该不是序列并行导致的 OOM 。

我观测了下第一个 epoch 的内存占用情况,发现由于用了cpu Offload 这两组实验的内存占用均超过90%。

8k seq len + sequence parallel 4 + grad acc 4:
image

2k seq len + sequence parallel 1 + grad acc 1
image

猜测是在从epoch 1 变更到 epoch 2 的时候,deepspeed Offload zero 3 没有及时释放一些占用的内存资源,导致在第 2 个epoch第一次更新参数的时候,cpu Offload 爆内存。具体什么原因导致没有及时释放内存资源,我还在 debug。

BTW. 目前我尝试了 16 卡的实验,发现可以正常使用cpu Offload 训练,要是方便您可以先尝试下。

@puppet101
Copy link
Author

好的,辛苦您帮忙排查一下~ 我这边目前还没有16卡可以用,只能先等待您的进展了。。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants