-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl通信边界问题? #17
Comments
目前没遇到问题 |
你们逐位对齐过精度了?还是说DeepSpeed这样的操作多此一举? |
据我所知,对齐的问题不影响nccl 集合通信的正确性,可能会影响速度。目前的模型维度以及切分方式下,已经是4byte 对齐的了。 精度逐个OP 以及E2E 与huggingface 的实现对齐过。 |
好的好的,感谢。我还有个问题请教下,看你们pr稿写Megatron-LM 的 DistributedOptimizer 实现了ZeRO-2的功能,但是我看代码,好像是实现的ZeRO-1的功能。grad_buffer求和后,没有释放不属于自己rank的梯度。不知道我看的是否准确。PR稿 |
我也这么觉得 |
需要给写PR稿的小伙子减鸡腿了 |
应该是ZeRO-2,reduce_scatter_grad后就会释放buffer |
方便贴一下代码?我学习一下,谢谢大佬! |
def _collect_grad(self, param, group_idx):
bucket = self._bucket_assignment[group_idx].get_param_bucket(param)
bucket.collect_param_grad(param)
if bucket.is_all_grad_collected():
target_buffer = self._param_buffer[group_idx].get_bucket_receiving_buffer(bucket)
bucket.reduce_scatter_grad(target_buffer) bucket收集完成所有梯度后,会调用reduce-scatter通信,最后会返还申请的buffer,从而释放梯度(即_grad_buffer) @nvtx.annotate("reduce_scatter_grad", color="indigo")
def reduce_scatter_grad(self, target_buffer):
assert self.is_all_grad_collected()
dist.reduce_scatter_tensor(output=target_buffer,
input=self._grad_buffer,
group=self._dp_group,
async_op=False)
Bucket._grad_buffer_pool.return_buffer(self._borrowed_grad_buffer)
self._borrowed_grad_buffer = None
self._grad_buffer = None
target_buffer.div_(self._num_partitions) |
大佬,我和作者说的是“Megatron-LM 的 DistributedOptimizer 实现了ZeRO-1的功能,PR稿内容有错误”,没说Megatron-LLaMA,不知道您是如何理解的,要不您在看看我问的问题? |
想请教下在 target_buffer 进行通信时,需不需要边界对齐呢?下面对比DeepSpeed列举下实际通信的tensor创建过程。
Megatron-LLaMA:
DeepSpeed:
想问下,如果通信时边界不对齐,nccl通信时候会不会错呀?
The text was updated successfully, but these errors were encountered: