gradient overflow when training 13b Llama Model on 7 a100s #19

awrd2019 · 2023-05-04T17:17:23Z

Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!

deepspeed --num_gpus=7 run_clm.py --deepspeed ds_config_stage2.json --model_name_or_path decapoda-research/llama-13b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 400 --gradient_accumulation_steps 3 --per_device_train_batch_size 2 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 400 --save_strategy steps --load_best_model_at_end=True --block_size=512 --report_to=wandb

mallorbc · 2023-11-27T18:01:55Z

Odd. Looking at DeepSpeed github may yield some help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient overflow when training 13b Llama Model on 7 a100s #19

gradient overflow when training 13b Llama Model on 7 a100s #19

awrd2019 commented May 4, 2023

mallorbc commented Nov 27, 2023

gradient overflow when training 13b Llama Model on 7 a100s #19

gradient overflow when training 13b Llama Model on 7 a100s #19

Comments

awrd2019 commented May 4, 2023

mallorbc commented Nov 27, 2023