-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
往logging_output写日志时导致内存泄露问题 #173
Comments
这是新手常见的错误,变量如果没有detach的话,计算图会保留,不会释放,导致内存越来越大。 |
请问你上述说的错误的代码是在哪个文件的第几行, 我现在训练蛋白-小分子复合体生成任务中的logging_output已经加了loss.data, 但是内存依旧随着训练轮次增加内存一直往上涨。请问你有和我一样的现象么 |
我那错误是自己写的损失类代码里,忘了加loss.data导致的,unimol本身代码没问题
***@***.***>
时间: 2024年6月2日 (周日) 20:55
主题: Re: [dptech-corp/Uni-Mol] 往logging_output写日志时导致内存泄露问题 (Issue #173)
***@***.***>
***@***.***>, ***@***.***>
请问你上述说的错误的代码是在哪个文件的第几行, 我现在训练蛋白-小分子复合体生成任务中的logging_output已经加了loss.data, 但是内存依旧随着训练轮次增加内存一直往上涨。请问你有和我一样的现象么
—
Reply to this email directly, view it on GitHub<#173 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BDSF446OBSCGQACMKI5ET7TZFMI5NAVCNFSM6AAAAAA6TA6VRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBTHA2DEMZVHE>.
You are receiving this because you authored the thread.[image: https://github.com/notifications/beacon/BDSF443FDJHNXWCHBU7NWVLZFMI5NA5CNFSM6AAAAAA6TA6VRWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT7ZBYDO.gif]Message ID: ***@***.***>
[image]
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
在模型训练过程发现内存(非显存)泄露问题,现象为随着训练轮次增加内存一直往上涨,最后定位到原因是在损失类里如下语句导致
logging_output = {
"loss": loss,
}
改为
logging_output = {
"loss": loss.data
}
问题:为何少加了.data,会导致内存泄露呢?
不加.data,reduce_metrics收到数据为
outputs [{'sample_size': 1, 'loss': tensor(5.6797, device='cuda:0', dtype=torch.float16,
grad_fn=)}]
加上.data为
outputs [{'sample_size': 1, 'loss': tensor(5.6797, device='cuda:0', dtype=torch.float16,)}]
看起来是不加.data,多了个grad_fn(激活函数?)导致,为何?
The text was updated successfully, but these errors were encountered: