Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this package support the one-gpu machine #206

Open
momo1986 opened this issue May 31, 2023 · 5 comments
Open

Can this package support the one-gpu machine #206

momo1986 opened this issue May 31, 2023 · 5 comments

Comments

@momo1986
Copy link

Hi, dear guys of tutelage team.

I have run the script and do some small modifications.
python -u main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.yaml --data-path /data/user1/junyan/datasets/ImageNet/ImageNet_Val --batch-size 128 --resume checkpoints/swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth

However, I have received the error message:

File "main_moe.py", line 374, in
main(config)
File "main_moe.py", line 141, in main
max_accuracy = load_checkpoint(config, model_without_ddp, optimizer, lr_scheduler, loss_scaler, logger)
File "/data/user1/junyan/adv_training/Swin-Transformer/utils_moe.py", line 45, in load_checkpoint
msg = model.load_state_dict(checkpoint['model'], strict=False)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1039, in load_state_dict
load(self)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load
load(child, prefix + name + '.')
[Previous line repeated 3 more times]
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1034, in load
state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
File "/root/.local/lib/python3.6/site-packages/tutel/impls/moe_layer.py", line 54, in _load_from_state_dict
assert buff_name in state_dict, "Could not find parameter %s in state_dict." % buff_name
AssertionError: Could not find parameter layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias in state_dict.

I have only one gpu. I am not sure whether multiple gpus are essential for this task. Is there a possibility to run it on one gpu? Furthermore, how can I resolve this problem of error?

I am looking forward to your response.

Thanks a lot.

Best Regards!

@momo1986
Copy link
Author

momo1986 commented Jun 1, 2023

Thanks for your kind comment.

@ghostplant
Copy link
Contributor

ghostplant commented Jun 1, 2023

One GPU per machine? Can you explain how many machines you'd like to run it? Or you just want to run it using 1 GPU on 1 machine?

@momo1986
Copy link
Author

momo1986 commented Jun 1, 2023

Hi, dear guys, @ghostplant.

I have several different one-gpu machines. To save the computation resource, running the program in a one-gpu machine would be economical for me. Actually, I mainly study some specific properties of MOE. Therefor, if it is OK, as you mentioned, just want to run it using 1 GPU on 1 machine.

@ghostplant
Copy link
Contributor

If you run it with a one-gpu machine, seems like you need to ensure this GPU memory size is enough to store all 32-expert parameters. The way to convert swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth to single-gpu can follow the utility here, where the second example is to merge 32 different checkpoint files into a single checkpoint file and it will be compatible for single gpu to load.

@momo1986
Copy link
Author

momo1986 commented Jun 2, 2023

Hi, @ghostplant. Thanks for your guidance. Can this package support run a single-gpu machine to test ImageNet? The user should implement this program manually, or is there a relevant demo?
Thanks & Regards!
Momo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants