Replies: 1 comment 3 replies
-
This is the memory allocated only to store parameters in the GPU memory without doing any work with it. During training, optimizers also store gradients, and this is what explains that the actual memory usage is 2x - 3x bigger. Additionally, the memory usage will also increase as per the batch size. Full finetuning is slow and memory-hungry. You can check LoRA and QLoRA strategies that provide comparable --or simetimes better performance- with a much lower memory usage. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This question is more focused on on full fine tune memory requirements rather than low memory / efficient inference, but I'm hoping it'll be relevant / helpful to community members here especially as fine tuning with llama.cpp graduates from an experimental feature!
I'm working on fine tuning LLMs of various sizes and I'm trying to get an understanding of what the GPU memory requirements are for training these. According to the rough rule of thumb in this article each 1B parameters should cost me 4GB at float32 and 8GB for the optimizer states. So that 12GB total per billion parameters at float32 or 10GB per billion parameters at float16.
I'm currently working on training 3B and 7B models (Llama 2) using HF accelerate + FSDP. I'm training in float16 and a batch size of 2 (I've also tried 1). Based on my math I should require somewhere on the order of 30GB of GPU memory for the 3B model and 70GB for the 7B model. I'm trying to fine tune with GPU memory on the order of 2x - 3x my the estimates above and I'm getting CUDA OOMs trying to allocate on the order of 80GB of memory for the 3B model. I'm also seeing indications of far larger memory requirements when reading about fine tuning some LLMs. According to this article a 176B param bloom model takes 5760 GBs of GPU memory takes ~32GB of memory per 1B parameters and I'm seeing mentions using 8x A100s for fine tuning Llama 2, which is nearly 10x what I'd expect based on the rule of thumb. So its clear that my understanding of this is wrong and I'm hoping someone help me get an intuitive understanding of two things:
Any insights on either of these would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions