[Bug] 使用moe的config微调报错 #153

wang-benqiang · 2024-03-28T14:58:32Z

描述该错误

非常感谢您的工作！
我在使用代码进行sft时遇到了一个问题。在不使用moe的config时能够很好的运行，在使用moe的config文件后报错。
运行代码：

torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_MoE4_sft.py --launcher "torch"

报错信息：

Traceback (most recent call last):
  File "train.py", line 324, in <module>
    main(args)
  File "train.py", line 105, in main
    model = initialize_model()
  File "/root/wbq/internlm_moe/InternEvo/internlm/utils/timeout.py", line 102, in wrapper
    result = func(*args, **kwargs)
  File "/root/wbq/internlm_moe/InternEvo/internlm/train/pipeline.py", line 167, in initialize_model
    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(gpc.config.model))
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 584, in build_model_with_moe_cfg
    return _build_generic_model_1d(num_layers=num_layers, num_chunks=num_chunks, **cfg)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 482, in _build_generic_model_1d
    chunk = PackedFlashInternLm1D(**filter_kwargs(PackedFlashInternLm1D.__init__, kwargs)).to(device)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 356, in __init__
    [
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 357, in <listcomp>
    PackedFlashBaseLayer1D(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 94, in __init__
    self.mixer = MHA(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/multi_head_attention.py", line 364, in __init__
    self.rotary_emb = RotaryEmbedding(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/embedding.py", line 287, in __init__
    self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
TypeError: arange() received an invalid combination of arguments - got (int, int, int, dtype=torch.dtype, device=device), but expected one of:
 * (Number end, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, *, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, Number step, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

环境信息

torch==2.1.0+cu118
transformers<4.30.0
sentencepiece
numpy
tqdm
psutil
packaging
pre-commit
ninja
gputil
pytest
packaging
boto3
botocore
torch-scatter
pyecharts
py-libnuma
pynvml
tensorboard

其他信息

1、我只修改了./configs/7B_MoE4_sft.py中训练集和测试集的地址

The text was updated successfully, but these errors were encountered:

sunpengsdu · 2024-03-28T15:00:11Z

我来复现下

wang-benqiang added the bug Something isn't working label Mar 28, 2024

mm-assistant bot assigned sunpengsdu Mar 28, 2024

sunpengsdu assigned sunpengsdu and unassigned sunpengsdu Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 使用moe的config微调报错 #153

[Bug] 使用moe的config微调报错 #153

wang-benqiang commented Mar 28, 2024

sunpengsdu commented Mar 28, 2024

[Bug] 使用moe的config微调报错 #153

[Bug] 使用moe的config微调报错 #153

Comments

wang-benqiang commented Mar 28, 2024

描述该错误

环境信息

其他信息

sunpengsdu commented Mar 28, 2024