Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU memory leak when training with CsvDataset-like dataset #849

Open
estherxue opened this issue Mar 31, 2024 · 10 comments
Open

CPU memory leak when training with CsvDataset-like dataset #849

estherxue opened this issue Mar 31, 2024 · 10 comments

Comments

@estherxue
Copy link

Hi, I tried to train with siglip loss on a large dataset and I found that during training (not evaluation), CPU memory usage kept increasing. The program was finally killed by the system. The data loading process is nothing special, similar to that of what csv dataset does. Has anyone encountered a similar problem?

@rwightman
Copy link
Collaborator

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

@estherxue
Copy link
Author

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

Below is the running script for siglip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 512 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/siglip_b16_60m_large_bs_no_wd/ \ --name large_bs \ --workers 12 \ --epochs 10 \ --model ViT-B-16-SigLIP \ --pretrained webli \ --warmup 0 \ --beta2 0.95 \ --lr 5e-5 \ --wd 0. \ --torchcompile \ --siglip

Below is the running script for clip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 768 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/clip_b32_id_2.5m_baseline/ \ --name large_bs \ --workers 12 \ --epochs 4 \ --model ViT-B-32-quickgelu \ --pretrained openai \ --use-thumbnail \ --warmup 0 \ --lr 5e-5 \ --wd 0. \ --torchcompile
The only difference that I can notice is the memory usage on CPU.

@miguelalba96
Copy link

did you tried training with standard clip loss?

@darkasevgen
Copy link

Hi, are there any updates?

@estherxue
Copy link
Author

did you tried training with standard clip loss?

I tried training with standard clip loss. I was wrong. Standard clip loss also has memory leak problem.

@estherxue
Copy link
Author

Hi, are there any updates?

I finally solved this problem by only training with limited batch number for each epoch as after the code finishes training for one epoch, the memory usage goes down.

@estherxue
Copy link
Author

It seems that this has nothing to do with the loss. The memory leak exists when doing evaluation.

@miguelalba96
Copy link

miguelalba96 commented Apr 26, 2024

I did some research on CPU memory leaks, and people say most of the time memory leaks appear when tensors are accumulated without being detached (as they carry with them the entire computational graph) or data loader issues such as copy-on-access (storing in the dataset definition naive python objects which reference count gets increased when being accessed by multiple processes (dataloader workers))

These resources might help you to debug:

If you find something let me know, I am experiencing RAM memory leaks too when fine-tuning CLIP using LoRA and standard distributed clip loss implemented in this repo, but I use torch ligthing Fabric launcher and mosaic ml streaming dataset instead of WebDataset

@rwightman
Copy link
Collaborator

rwightman commented May 9, 2024

we've done a lot of large scale training, long durations, big datasets and never found any noteworthy issues with dataloader memory leaks and the webdataset code. We don't use csv datasets though so possibly an issue there.

There is significant memory churn when you're plowing through really large datasets. Some allocators have issues with fragmentation over time. I usually patch the allocator to use tcmalloc. LD_PRELOAD=/lib/<system dependent>/libtcmalloc.so.4 ... apt get install google-perftools to get the lib.

Should point out that normal 'validation' is VERY memory intensive if you have a lot of samples in your val dataset, it should be treated as a 'gallery' style dataset where it's a hand picked limited set of test samples as it does a full similarity matrix. That could really spike memory. We usually use zero-shot eval to gauge progress as it's more sane to run across larger val sets and often the metric that most focus on (though valid arguments for preferring other val metrics too).

A batch-wise (average over batched similarities) would be possible but not implemented.

@estherxue estherxue changed the title CPU memory leak when training with siglip loss CPU memory leak when training with CsvDataset-like dataset May 11, 2024
@jn2clark
Copy link

Using CSV datasets with the native implementation will lead to an increase in memory. As @miguelalba96 linked, it is not a bug but expected behavior. The solution is either

  1. use a different dataset format like webdataset. This will be streaming though and not map style.
  2. if you want map style with minimal changes then you can move to another structure that has no copy-on-read. Pyarrow has worked in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants