Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 使用moe的config微调报错 #153

Open
wang-benqiang opened this issue Mar 28, 2024 · 1 comment
Open

[Bug] 使用moe的config微调报错 #153

wang-benqiang opened this issue Mar 28, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@wang-benqiang
Copy link

描述该错误

非常感谢您的工作!
我在使用代码进行sft时遇到了一个问题。在不使用moe的config时能够很好的运行,在使用moe的config文件后报错。
运行代码:

torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_MoE4_sft.py --launcher "torch"

报错信息:

Traceback (most recent call last):
  File "train.py", line 324, in <module>
    main(args)
  File "train.py", line 105, in main
    model = initialize_model()
  File "/root/wbq/internlm_moe/InternEvo/internlm/utils/timeout.py", line 102, in wrapper
    result = func(*args, **kwargs)
  File "/root/wbq/internlm_moe/InternEvo/internlm/train/pipeline.py", line 167, in initialize_model
    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(gpc.config.model))
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 584, in build_model_with_moe_cfg
    return _build_generic_model_1d(num_layers=num_layers, num_chunks=num_chunks, **cfg)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 482, in _build_generic_model_1d
    chunk = PackedFlashInternLm1D(**filter_kwargs(PackedFlashInternLm1D.__init__, kwargs)).to(device)
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 356, in __init__
    [
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 357, in <listcomp>
    PackedFlashBaseLayer1D(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modeling_moe.py", line 94, in __init__
    self.mixer = MHA(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/multi_head_attention.py", line 364, in __init__
    self.rotary_emb = RotaryEmbedding(
  File "/root/wbq/internlm_moe/InternEvo/internlm/model/modules/embedding.py", line 287, in __init__
    self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
TypeError: arange() received an invalid combination of arguments - got (int, int, int, dtype=torch.dtype, device=device), but expected one of:
 * (Number end, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, *, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (Number start, Number end, Number step, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

环境信息

torch==2.1.0+cu118
transformers<4.30.0
sentencepiece
numpy
tqdm
psutil
packaging
pre-commit
ninja
gputil
pytest
packaging
boto3
botocore
torch-scatter
pyecharts
py-libnuma
pynvml
tensorboard

其他信息

1、我只修改了./configs/7B_MoE4_sft.py中训练集和测试集的地址

@wang-benqiang wang-benqiang added the bug Something isn't working label Mar 28, 2024
@sunpengsdu sunpengsdu assigned sunpengsdu and unassigned sunpengsdu Mar 28, 2024
@sunpengsdu
Copy link
Contributor

我来复现下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants