Fine tuning GPU memory requirements #2904

jmif · 2023-08-30T14:20:52Z

jmif
Aug 30, 2023

This question is more focused on on full fine tune memory requirements rather than low memory / efficient inference, but I'm hoping it'll be relevant / helpful to community members here especially as fine tuning with llama.cpp graduates from an experimental feature!

I'm working on fine tuning LLMs of various sizes and I'm trying to get an understanding of what the GPU memory requirements are for training these. According to the rough rule of thumb in this article each 1B parameters should cost me 4GB at float32 and 8GB for the optimizer states. So that 12GB total per billion parameters at float32 or 10GB per billion parameters at float16.

I'm currently working on training 3B and 7B models (Llama 2) using HF accelerate + FSDP. I'm training in float16 and a batch size of 2 (I've also tried 1). Based on my math I should require somewhere on the order of 30GB of GPU memory for the 3B model and 70GB for the 7B model. I'm trying to fine tune with GPU memory on the order of 2x - 3x my the estimates above and I'm getting CUDA OOMs trying to allocate on the order of 80GB of memory for the 3B model. I'm also seeing indications of far larger memory requirements when reading about fine tuning some LLMs. According to this article a 176B param bloom model takes 5760 GBs of GPU memory takes ~32GB of memory per 1B parameters and I'm seeing mentions using 8x A100s for fine tuning Llama 2, which is nearly 10x what I'd expect based on the rule of thumb. So its clear that my understanding of this is wrong and I'm hoping someone help me get an intuitive understanding of two things:

How do I reason about how much GPU memory is required to fine tune a model? What characteristics of the model / training determine memory usage?
How do I reason about the required GPU memory size for a given model? My understanding of FSDP is that each wrapped module essentially gets split across GPUs (and nested FSDP wraps also get split) so it seems that there is some minimum GPU memory requirement based on the size of the largest module that is split (ie you can't shard your embedding table so you need a GPU that can fit that + it's optimizer states).

Any insights on either of these would be greatly appreciated!

monatis · 2023-08-30T14:36:53Z

monatis
Aug 30, 2023
Collaborator

each 1B parameters should cost me 4GB at float32 and 8GB for the parameter states.

This is the memory allocated only to store parameters in the GPU memory without doing any work with it. During training, optimizers also store gradients, and this is what explains that the actual memory usage is 2x - 3x bigger. Additionally, the memory usage will also increase as per the batch size.

Full finetuning is slow and memory-hungry. You can check LoRA and QLoRA strategies that provide comparable --or simetimes better performance- with a much lower memory usage.

3 replies

jmif Aug 30, 2023
Author

Came across this after searching for gradients and this helps to explain it, thank you! So based on this thats 16 bytes per param * 1B = 16GB per param. That's 48GB for a 3B param model and 112GB for a 7B. That's also not counting the forward activations and temporary memory allocations, which appear to be highly dependent on the model architecture and the sequence length. 112GB for a 7B gets us to about 20% the 8x A100 memory and from what I can tell sequence length / architecture can significantly increase memory size. Does it make sense that 80% of the memory usage for training Llama2 goes to forward activations, general usage (temp allocations, etc) and batch size?

Nikitala0014 Mar 9, 2024

Hi, were you able to resolve this issue? I’m also now wondering how much gpu memory is needed to fully fine tune the 7b model. if you share your knowledge it would help me a lot. Thanks

shaneelsharmaglaccount May 7, 2024

Hi Everyone, Not sure if this could help for 7B and other responses on this thread were also helpful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning GPU memory requirements #2904

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fine tuning GPU memory requirements #2904

jmif Aug 30, 2023

Replies: 1 comment · 3 replies

monatis Aug 30, 2023 Collaborator

jmif Aug 30, 2023 Author

Nikitala0014 Mar 9, 2024

shaneelsharmaglaccount May 7, 2024

jmif
Aug 30, 2023

Replies: 1 comment 3 replies

monatis
Aug 30, 2023
Collaborator

jmif Aug 30, 2023
Author