-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单机多卡报索引错误 #170
Comments
can you provide more details?(cuda version, c10d error info etc.) |
在我的dataset的collater(dataloader的回调函数)里加了判空处理,解决了IndexError: list index out of rang错误,如下 cuda版本: File "/opt/conda/lib/python3.8/site-packages/unicore-0.0.1-py3.8-linux-x86_64.egg/unicore/trainer.py", line 1041, in _check_grad_norms
|
seems some unused modules in your network, you can add |
During handling of the above exception, another exception occurred: Traceback (most recent call last): |
用--find-unused-parameters试了,能跑起来,但非常非常慢,比不用unicore的框架,加--find-unused-parameters 慢很多。可能是不用框架时--ddp-backend用的nccl,unicore我看目前好像是不支持nccl的,你们后期有支持打算吗? 还有个现象,在训练过程中,发现内存使用越来越大,是否需要特殊的参数配置?,我的参数如下: |
@zhangxiaaobo the NCCL support is built-in by PyTorch, not by Uni-Core. Uni-Core itself is a wrapper for PyTorch. Both c10d and no_c10d are not related NCCL, they are different kinds of data-parallel algorithms. you can refer to this https://zhuanlan.zhihu.com/p/580852851 |
同样脚本,在单机单卡时没问题,单机多卡时,在一个epoch即将结束时报错,如下
File "/home/jovyan/tasks/training_dataset.py", line 107, in collater
return default_collate(samples)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 128, in default_collate
elem = batch[0]
IndexError: list index out of range
调用脚本:
python -m torch.distributed.launch --nproc_per_node=3 --master_port=10096 /opt/conda/bin/unicore-train {data_path} --user-dir {user_dir} --train-subset train --valid-subset valid
--num-workers 8 --ddp-backend=no_c10d
--task {task} --loss {loss_func} --arch {arch}
--optimizer adam --adam-betas '(0.9, 0.99)' --adam-eps 1e-6 --clip-norm 1.0
--lr-scheduler polynomial_decay --lr {lr} --warmup-ratio {warmup} --max-epoch {epoch} --batch-size {local_batch_size}
--update-freq {update_freq} --seed {seed}
--log-interval 100 --log-format simple
--validate-interval 1 --keep-last-epochs 10
--best-checkpoint-metric {metric} --patience 20
--save-dir {save_dir}
备注:
1 ddp-backend=c10d提示错误,并建议改成no_c10d
2 training_dataset.py,dataset中collater使用torch的默认实现,如下
from torch.utils.data.dataloader import default_collate
def collater(self, samples):
return default_collate(samples)
getitem 返回逗号分割的tensor对象,如下
return (
torch.LongTensor(ligand_data["ligand_token"]),
torch.LongTensor(ligand_encoder_data["ligand_encoder_coords"]),
torch.LongTensor(ligand_encoder_data["ligand_encoder_token"]),
torch.FloatTensor(ligand_encoder_data["ligand_mask"]),
torch.LongTensor(ligand_data["ligand_coords"]),
torch.LongTensor(ligand_data["root_idxs"]),
torch.LongTensor(ligand_data["root_root_idxs"]),
torch.LongTensor(ligand_data["root_root_root_idxs"]),
torch.LongTensor(ligand_data["theta"]),
torch.LongTensor(ligand_data["dist"]),
torch.LongTensor(ligand_data["degree"])
'
The text was updated successfully, but these errors were encountered: