We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Megatron-LLaMA/megatron/optimizer/distrib_optimizer.py
Lines 926 to 939 in 25306de
这里执行的应该是使得dp组内每个成员只获得自己维护的那一部分参数梯度的求和吧?
但这样做的话,在后面optimizer.step()中计算的grad_norm是不是就不是很准确了?
因为我看grad_norm计算的时候是dp组内每个成员把自己那部分模型的所有param的所有梯度都平方加和了,但是每个成员的grad只有一部分完成了dp组内求和,这样求出来的grad_norm感觉是错的。
请问是否确实存在这样的问题呢?
The text was updated successfully, but these errors were encountered:
Megatron-LLaMA/megatron/optimizer/clip_grads.py
Line 92 in 25306de
可以看看这段代码
Sorry, something went wrong.
No branches or pull requests
Megatron-LLaMA/megatron/optimizer/distrib_optimizer.py
Lines 926 to 939 in 25306de
这里执行的应该是使得dp组内每个成员只获得自己维护的那一部分参数梯度的求和吧?
但这样做的话,在后面optimizer.step()中计算的grad_norm是不是就不是很准确了?
因为我看grad_norm计算的时候是dp组内每个成员把自己那部分模型的所有param的所有梯度都平方加和了,但是每个成员的grad只有一部分完成了dp组内求和,这样求出来的grad_norm感觉是错的。
请问是否确实存在这样的问题呢?
The text was updated successfully, but these errors were encountered: