Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl通信边界问题? #17

Open
Baibaifan opened this issue Sep 14, 2023 · 10 comments
Open

nccl通信边界问题? #17

Baibaifan opened this issue Sep 14, 2023 · 10 comments

Comments

@Baibaifan
Copy link

想请教下在 target_buffer 进行通信时,需不需要边界对齐呢?下面对比DeepSpeed列举下实际通信的tensor创建过程。
Megatron-LLaMA:

        self._partitioned_param = torch.empty(total_size,
                                              device=self._flatted_buffer.device,
                                              dtype=self._flatted_buffer.dtype)

DeepSpeed:

        #align nccl all-gather send buffers to 4-bye boundary
        self.nccl_start_alignment_factor = 2  # 4-byte alignment/sizeof(fp16) = 2
...
          self.bf16_groups_flat.append(
              self._flatten_dense_tensors_aligned(self.bf16_groups[i],
                                                  self.nccl_start_alignment_factor * dp_world_size))

想问下,如果通信时边界不对齐,nccl通信时候会不会错呀?

@li-yi-dong
Copy link
Collaborator

目前没遇到问题

@Baibaifan
Copy link
Author

目前没遇到问题

你们逐位对齐过精度了?还是说DeepSpeed这样的操作多此一举?

@li-yi-dong
Copy link
Collaborator

目前没遇到问题

你们逐位对齐过精度了?还是说DeepSpeed这样的操作多此一举?

据我所知,对齐的问题不影响nccl 集合通信的正确性,可能会影响速度。目前的模型维度以及切分方式下,已经是4byte 对齐的了。

精度逐个OP 以及E2E 与huggingface 的实现对齐过。

@Baibaifan
Copy link
Author

Baibaifan commented Sep 15, 2023

精度逐个OP 以及E2E 与huggingface 的实现对齐过

好的好的,感谢。我还有个问题请教下,看你们pr稿写Megatron-LM 的 DistributedOptimizer 实现了ZeRO-2的功能,但是我看代码,好像是实现的ZeRO-1的功能。grad_buffer求和后,没有释放不属于自己rank的梯度。不知道我看的是否准确。PR稿

@li-yi-dong
Copy link
Collaborator

精度逐个OP 以及E2E 与huggingface 的实现对齐过

好的好的,感谢。我还有个问题请教下,看你们pr稿写Megatron-LM 的 DistributedOptimizer 实现了ZeRO-2的功能,但是我看代码,好像是实现的ZeRO-1的功能。grad_buffer求和后,没有释放不属于自己rank的梯度。不知道我看的是否准确。PR稿

我也这么觉得

@Baibaifan
Copy link
Author

精度逐个OP 以及E2E 与huggingface 的实现对齐过

好的好的,感谢。我还有个问题请教下,看你们pr稿写Megatron-LM 的 DistributedOptimizer 实现了ZeRO-2的功能,但是我看代码,好像是实现的ZeRO-1的功能。grad_buffer求和后,没有释放不属于自己rank的梯度。不知道我看的是否准确。PR稿

我也这么觉得

需要给写PR稿的小伙子减鸡腿了

@yinzhijian
Copy link

应该是ZeRO-2,reduce_scatter_grad后就会释放buffer

@Baibaifan
Copy link
Author

reduce_scatter_grad

方便贴一下代码?我学习一下,谢谢大佬!

@yinzhijian
Copy link

reduce_scatter_grad

方便贴一下代码?我学习一下,谢谢大佬!

    def _collect_grad(self, param, group_idx):
        bucket = self._bucket_assignment[group_idx].get_param_bucket(param)
        bucket.collect_param_grad(param)

        if bucket.is_all_grad_collected():
            target_buffer = self._param_buffer[group_idx].get_bucket_receiving_buffer(bucket)
            bucket.reduce_scatter_grad(target_buffer)

bucket收集完成所有梯度后,会调用reduce-scatter通信,最后会返还申请的buffer,从而释放梯度(即_grad_buffer)

    @nvtx.annotate("reduce_scatter_grad", color="indigo")
    def reduce_scatter_grad(self, target_buffer):
        assert self.is_all_grad_collected()

        dist.reduce_scatter_tensor(output=target_buffer,
                                            input=self._grad_buffer,
                                            group=self._dp_group,
                                            async_op=False)

        Bucket._grad_buffer_pool.return_buffer(self._borrowed_grad_buffer)
        self._borrowed_grad_buffer = None
        self._grad_buffer = None
        target_buffer.div_(self._num_partitions)

@Baibaifan
Copy link
Author

Baibaifan commented Sep 19, 2023

_collect_grad

大佬,我和作者说的是“Megatron-LM 的 DistributedOptimizer 实现了ZeRO-1的功能,PR稿内容有错误”,没说Megatron-LLaMA,不知道您是如何理解的,要不您在看看我问的问题?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants