Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

往logging_output写日志时导致内存泄露问题 #173

Open
smxzhangxiaaobo opened this issue Oct 27, 2023 · 3 comments
Open

往logging_output写日志时导致内存泄露问题 #173

smxzhangxiaaobo opened this issue Oct 27, 2023 · 3 comments

Comments

@smxzhangxiaaobo
Copy link

在模型训练过程发现内存(非显存)泄露问题,现象为随着训练轮次增加内存一直往上涨,最后定位到原因是在损失类里如下语句导致
logging_output = {
"loss": loss,
}
改为
logging_output = {
"loss": loss.data
}
问题:为何少加了.data,会导致内存泄露呢?

不加.data,reduce_metrics收到数据为
outputs [{'sample_size': 1, 'loss': tensor(5.6797, device='cuda:0', dtype=torch.float16,
grad_fn=)}]
加上.data为
outputs [{'sample_size': 1, 'loss': tensor(5.6797, device='cuda:0', dtype=torch.float16,)}]
看起来是不加.data,多了个grad_fn(激活函数?)导致,为何?

@guolinke
Copy link
Member

这是新手常见的错误,变量如果没有detach的话,计算图会保留,不会释放,导致内存越来越大。
refer to https://discuss.pytorch.org/t/memory-leak-when-appending-tensors-to-a-list/25937

@zhuhui-in
Copy link

请问你上述说的错误的代码是在哪个文件的第几行, 我现在训练蛋白-小分子复合体生成任务中的logging_output已经加了loss.data, 但是内存依旧随着训练轮次增加内存一直往上涨。请问你有和我一样的现象么

@smxzhangxiaaobo
Copy link
Author

smxzhangxiaaobo commented Jun 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants