Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡报索引错误 #170

Open
zhangxiaaobo opened this issue Oct 24, 2023 · 6 comments
Open

单机多卡报索引错误 #170

zhangxiaaobo opened this issue Oct 24, 2023 · 6 comments

Comments

@zhangxiaaobo
Copy link

zhangxiaaobo commented Oct 24, 2023

同样脚本,在单机单卡时没问题,单机多卡时,在一个epoch即将结束时报错,如下
File "/home/jovyan/tasks/training_dataset.py", line 107, in collater
return default_collate(samples)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 128, in default_collate
elem = batch[0]
IndexError: list index out of range

调用脚本:
python -m torch.distributed.launch --nproc_per_node=3 --master_port=10096 /opt/conda/bin/unicore-train {data_path} --user-dir {user_dir} --train-subset train --valid-subset valid
--num-workers 8 --ddp-backend=no_c10d
--task {task} --loss {loss_func} --arch {arch}
--optimizer adam --adam-betas '(0.9, 0.99)' --adam-eps 1e-6 --clip-norm 1.0
--lr-scheduler polynomial_decay --lr {lr} --warmup-ratio {warmup} --max-epoch {epoch} --batch-size {local_batch_size}
--update-freq {update_freq} --seed {seed}
--log-interval 100 --log-format simple
--validate-interval 1 --keep-last-epochs 10
--best-checkpoint-metric {metric} --patience 20
--save-dir {save_dir}

备注:
1 ddp-backend=c10d提示错误,并建议改成no_c10d
2 training_dataset.py,dataset中collater使用torch的默认实现,如下
from torch.utils.data.dataloader import default_collate
def collater(self, samples):
return default_collate(samples)
getitem 返回逗号分割的tensor对象,如下
return (
torch.LongTensor(ligand_data["ligand_token"]),
torch.LongTensor(ligand_encoder_data["ligand_encoder_coords"]),
torch.LongTensor(ligand_encoder_data["ligand_encoder_token"]),
torch.FloatTensor(ligand_encoder_data["ligand_mask"]),
torch.LongTensor(ligand_data["ligand_coords"]),
torch.LongTensor(ligand_data["root_idxs"]),
torch.LongTensor(ligand_data["root_root_idxs"]),
torch.LongTensor(ligand_data["root_root_root_idxs"]),
torch.LongTensor(ligand_data["theta"]),
torch.LongTensor(ligand_data["dist"]),
torch.LongTensor(ligand_data["degree"])
'

@Naplessss
Copy link
Contributor

can you provide more details?(cuda version, c10d error info etc.)

@zhangxiaaobo
Copy link
Author

在我的dataset的collater(dataloader的回调函数)里加了判空处理,解决了IndexError: list index out of rang错误,如下
def collater(self, samples):
if len(samples) == 0:
return []
return default_collate(samples)
问题:
在单机多卡中,配成--ddp-backend=no_c10d可以跑通了。配成--ddp-backend=c10d会报错,错误提示如下,配成no_c10d对训练速度会有多大影响?

cuda版本:
NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4
配置c10是错误:
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 674, in train_step
self._check_grad_norms(grad_norm)loss, sample_size, logging_output = loss(model, sample)

File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 1041, in _check_grad_norms
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
raise FloatingPointError(
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?

grad_norm across the workers:
rank 0 = 92.33207703
rank 1 = 92.39173126
rank 2 = 92.78269958

@Naplessss
Copy link
Contributor

seems some unused modules in your network, you can add --find-unused-parameters to avoid this.

@zhangxiaaobo
Copy link
Author

在我的dataset的collater(dataloader的回调函数)里加了判空处理,解决了IndexError: list index out of rang错误,如下 def collater(self, samples): if len(samples) == 0: return [] return default_collate(samples) 问题: 在单机多卡中,配成--ddp-backend=no_c10d可以跑通了。配成--ddp-backend=c10d会报错,错误提示如下,配成no_c10d对训练速度会有多大影响?

cuda版本: NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 配置c10是错误: File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 674, in train_step self._check_grad_norms(grad_norm)loss, sample_size, logging_output = loss(model, sample)

File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 1041, in _check_grad_norms

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
raise FloatingPointError(
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
grad_norm across the workers: rank 0 = 92.33207703 rank 1 = 92.39173126 rank 2 = 92.78269958

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/bin/unicore-train", line 33, in
sys.exit(load_entry_point('swcore==0.0.1', 'console_scripts', 'unicore-train')())
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore_cli/train.py", line 418, in cli_main
distributed_utils.call_main(args, main)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/distributed/utils.py", line 190, in call_main
distributed_main(int(os.environ['LOCAL_RANK']), main, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/distributed/utils.py", line 164, in distributed_main
main(args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore_cli/train.py", line 125, in main
valid_losses, should_stop = train(
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore_cli/train.py", line 219, in train
log_output = trainer.train_step(samples)
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 705, in train_step
self.task.train_step(
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/tasks/unicore_task.py", line 279, in train_step
loss, sample_size, logging_output = loss(model, sample)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jovyan/multi_model/losses/3dmg_pretrain_loss.py", line 99, in forward
) = model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/swcore/distributed/module_proxy_wrapper.py", line 56, in forward
return self.module(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

@zhangxiaaobo
Copy link
Author

zhangxiaaobo commented Oct 25, 2023

用--find-unused-parameters试了,能跑起来,但非常非常慢,比不用unicore的框架,加--find-unused-parameters 慢很多。可能是不用框架时--ddp-backend用的nccl,unicore我看目前好像是不支持nccl的,你们后期有支持打算吗?

还有个现象,在训练过程中,发现内存使用越来越大,是否需要特殊的参数配置?,我的参数如下:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=10097 --use-env /opt/conda/bin/unicore-train /home/jovyan/test_dir/3dmg --user-dir /home/jovyan/3dmg_model --train-subset train --valid-subset valid --num-workers 40 --ddp-backend=no_c10d --data-buffer-size 20 --task 3dmg_pretrain --loss 3dmg_pretrain_loss --arch 3dmg_pretrain --optimizer adam --adam-betas '(0.9, 0.99)' --adam-eps 1e-6 --clip-norm 1.0 --lr-scheduler polynomial_decay --lr 0.0001 --warmup-ratio 0.06 --max-epoch 100 --batch-size 140 --update-freq 1.0 --seed 0 --log-interval 100 --log-format simple --fp16 --fp16-init-scale 4 --fp16-scale-window 256 --validate-interval 1 --keep-last-epochs 10 --best-checkpoint-metric loss --patience 20 --save-dir /home/jovyan/test_dir/save_dir_3dmg_mul

@guolinke
Copy link
Member

@zhangxiaaobo the NCCL support is built-in by PyTorch, not by Uni-Core. Uni-Core itself is a wrapper for PyTorch. Both c10d and no_c10d are not related NCCL, they are different kinds of data-parallel algorithms. you can refer to this https://zhuanlan.zhihu.com/p/580852851

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants