Error when doing deepcopy of the model #177

yzxing87 · 2022-08-03T15:53:41Z

Hi, thanks for this awesome project!

I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with ema_model = copy.deepcopy(my_transformer_model), I encounter the error:

File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
    value = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroupNCCL' object

Could you help me with that? How can I use ema with tutel? Thanks!

The text was updated successfully, but these errors were encountered:

ghostplant · 2022-08-04T05:52:33Z

If you pickle the model for single GPU, anything will be fine because AllToAll is not included in Tutel's MoE layer. Is that okay to your expectation?

Pytorch's NCCL operations (e.g. AllToAll) don't support pickle, so I'm afraid that any MoE models having AllToAll in their forward pass will have the same issue. You may either ask Torch community to fix it up, or you do any workaround in ema that doesn't require deepcopy, or you can only boot MoE model in data parallel mode, which wouldn't have AllToAll in the forward pass, though the distributed performance will be very bad in large scale.

yzxing87 · 2022-08-07T12:38:24Z

Thanks for your prompt reply!
I use 8 GPUs to train my model with 8 experts. In that case, can I pickle my model for a single GPU? I also would like to know if it is a requirement to save the checkpoints separately for each rank? Does it always need 8 GPUs for inference if I trained the model with 8 GPUs?

ghostplant · 2022-08-09T01:21:18Z

You can go through these examples to convert training checkpoints between distributed version and single-device version: https://github.com/microsoft/tutel#how-to-convert-checkpoint-files-that-adapt-to-different-distributed-world-sizes

yzxing87 · 2022-08-09T11:37:26Z

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

ghostplant · 2022-08-09T12:12:00Z

Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints?

Yes, both are compatible, as mpiexec is just an alternative way to launch cross-node processes instead of torch.distributed.launch.

ghostplant mentioned this issue Aug 8, 2022

Example on saving experts to one model when using distributed training #178

Open

ghostplant added the enhancement New feature or request label Aug 8, 2022

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 8, 2022

support tutel.checkpoint.* for issue microsoft#177

4b6c871

ghostplant added a commit to ghostplant/tutel that referenced this issue Aug 8, 2022

support tutel.checkpoint.* for issue microsoft#177

665a64c

ghostplant added a commit that referenced this issue Aug 9, 2022

support tutel.checkpoint.* for issue #177 (#181)

a2242e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when doing deepcopy of the model #177

Error when doing deepcopy of the model #177

yzxing87 commented Aug 3, 2022

ghostplant commented Aug 4, 2022 •

edited

yzxing87 commented Aug 7, 2022

ghostplant commented Aug 9, 2022 •

edited

yzxing87 commented Aug 9, 2022

ghostplant commented Aug 9, 2022

Error when doing deepcopy of the model #177

Error when doing deepcopy of the model #177

Comments

yzxing87 commented Aug 3, 2022

ghostplant commented Aug 4, 2022 • edited

yzxing87 commented Aug 7, 2022

ghostplant commented Aug 9, 2022 • edited

yzxing87 commented Aug 9, 2022

ghostplant commented Aug 9, 2022

ghostplant commented Aug 4, 2022 •

edited

ghostplant commented Aug 9, 2022 •

edited