Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Activation Checkpoint #666

Open
samplise opened this issue Sep 6, 2023 · 0 comments
Open

[Feature]: Activation Checkpoint #666

samplise opened this issue Sep 6, 2023 · 0 comments

Comments

@samplise
Copy link
Collaborator

samplise commented Sep 6, 2023

Activation checkpoint is a technique to reduce the memory footprint on a single GPU by trading computing for memory. When an activation checkpoint is applied to a group of consecutive layers, only the output of the last layer is cached for the backward computation. All other intermediate outputs are not stored during the forward pass. During the backward pass, re-computation is triggered to obtain the intermediate outputs temporarily for gradient computation. As a result, the memory consumed by inter- mediate activations can be significantly reduced, making more memory available to accommodate larger models. As this technique does not shard tensors, it stays orthogonal to other parallelization techniques.

This issue is to find the best strategy to determine where to use activation checkpoints in the training pipelines for the best running time with a given memory budget.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant