[Feature]: Activation Checkpoint #666

samplise · 2023-09-06T07:43:38Z

Activation checkpoint is a technique to reduce the memory footprint on a single GPU by trading computing for memory. When an activation checkpoint is applied to a group of consecutive layers, only the output of the last layer is cached for the backward computation. All other intermediate outputs are not stored during the forward pass. During the backward pass, re-computation is triggered to obtain the intermediate outputs temporarily for gradient computation. As a result, the memory consumed by inter- mediate activations can be significantly reduced, making more memory available to accommodate larger models. As this technique does not shard tensors, it stays orthogonal to other parallelization techniques.

This issue is to find the best strategy to determine where to use activation checkpoints in the training pipelines for the best running time with a given memory budget.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Activation Checkpoint #666

[Feature]: Activation Checkpoint #666

samplise commented Sep 6, 2023

[Feature]: Activation Checkpoint #666

[Feature]: Activation Checkpoint #666

Comments

samplise commented Sep 6, 2023