text_to_image multi-gpu not working #7897

Sunflower54 · 2024-05-09T05:31:09Z

We are training text_to_image on Google cloud platform, the jupyterlab instance has 2 GPUs (NVIDIA Tesla P100) with a total memory of 32GB (16GB each). I tried using accelerate for training the text_to_image model for multi_gpu support. But still getting out of memory error. Even with 32GB, I don't understand why its only taking 16GB memory

Command used: accelerate launch --multi_gpu train_text_to_image.py --pretrained_model_name_or_path=$MODEL_NAME --train_data_dir=$DATASET_DIR --image_column="image" --caption_column="text" --output_dir=$OUTPUT_DIR --train_batch_size=2 --resolution=512 --gradient_accumulation_steps=5 --num_train_epochs=1000 --learning_rate=1e-06 --gradient_checkpointing --enable_xformers_memory_efficient_attention

rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB. GPU has a total capacity of 15.89 GiB of which 89.12 MiB is free. Including non-PyTorch memory, this process has 15.80 GiB memory in use. Of the allocated memory 15.35 GiB is allocated by PyTorch, and 71.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Any help will be much appreciated. Thanks.

The text was updated successfully, but these errors were encountered:

bghira · 2024-05-09T17:46:53Z

i'm not sure which model you're training with it, but it looks like you're running into the classic problem with DDP training, aka Distributed Data Parallel.

this style of multi-GPU training runs a single instance of the trainer on each GPU, and loads everything equivalently on both. this means when using 2x 16G GPUs you don't have access to 32G, but just 2x 16G.

what you're looking for to split across two GPUs is called FSDP, fully sharded data parallel, which effectively splits layers and has a high communication overhead between GPUs. this kind of thing benefits from nvlink a lot more and also isn't supported in the Diffusers example trainers, or really any publicly accessible diffusion training toolkit that i'm aware of.

Sunflower54 · 2024-05-10T05:21:26Z

Hello, I am using stable-diffusion 2.1 as the model. FSDP is not supported in stable diffusion? Is there any alternate way to train the model?

bghira · 2024-05-10T10:54:19Z

pytorch/pytorch#91165

FSDP isn't supported by pytorch in general

you need GPUs with more VRAM, and in my experience GCP is one of teh most expensive routes to do this.

Sunflower54 · 2024-05-14T05:28:19Z

We have to use GCP in the office as there's no access to physical GPUs. Even with accelerate or --multi-gpu we can't run the pytorch models on GCP?

bghira · 2024-05-16T17:04:45Z

what i meant is the 16gb gpu through GCP is not as cost-effective as other platforms like Vast or RunPod where you can likely rent a single 48gb gpu for less than a dual 16gb instance on GCP

you can possibly get away with a low rank (LoRA) training on the two 16gb devices but as they lack intrinsic bf16 support (iirc) they are limited in utility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_to_image multi-gpu not working #7897

text_to_image multi-gpu not working #7897

Sunflower54 commented May 9, 2024

bghira commented May 9, 2024

Sunflower54 commented May 10, 2024

bghira commented May 10, 2024

Sunflower54 commented May 14, 2024

bghira commented May 16, 2024 •

edited

text_to_image multi-gpu not working #7897

text_to_image multi-gpu not working #7897

Comments

Sunflower54 commented May 9, 2024

bghira commented May 9, 2024

Sunflower54 commented May 10, 2024

bghira commented May 10, 2024

Sunflower54 commented May 14, 2024

bghira commented May 16, 2024 • edited

bghira commented May 16, 2024 •

edited