【Program hangs with no output.】 #626

Luo-Z13 · 2024-04-28T11:54:06Z

I am conducting the instruction tuning of llama3_llava using the script on my own dataset
NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero3_offload --seed 1024. After the following output, the program stops outputting but is still running:

 - mmengine - INFO - Iter(train) [   10/23076]  lr: 1.3034e-07  eta: 3 days, 3:30:35  time: 11.7851  data_time: 0.0298  memory: 15547  loss: nan
 - mmengine - INFO - Iter(train) [   20/23076]  lr: 2.7506e-07  eta: 3 days, 5:46:56  time: 12.5050  data_time: 0.0199  memory: 9964  loss: nan

This state has been ongoing for 2 hours. What could be the possible cause for this?

The text was updated successfully, but these errors were encountered:

LZHgrla · 2024-04-29T06:25:22Z

@Luo-Z13
The total number of iterations is a bit strange. Did you modify the settings in config?

Luo-Z13 · 2024-04-29T07:41:18Z

@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?

My script:

NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
                                       --deepspeed deepspeed_zero3_offload --seed 1024

The training schedule:

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

Then, I modify the save_steps, and change other parts about the paths to my own data or local paths. Besides that, there were no other changes.

LZHgrla · 2024-04-29T07:43:07Z

@Luo-Z13
How many GPUs are you using for training?

Luo-Z13 · 2024-04-29T07:44:50Z

@Luo-Z13 How many GPUs are you using for training?

I use 4*A100(40G)

Luo-Z13 · 2024-04-29T07:45:41Z

@Luo-Z13 How many GPUs are you using for training?

And the pre-training of LLaVA-llama3 is normal.

LZHgrla · 2024-04-29T07:53:10Z

@Luo-Z13

Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

Luo-Z13 · 2024-04-29T07:55:18Z

@Luo-Z13

Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

Hello, I'm using my own instruction-tuning data, so the total number of iterations is different. Do I need to check the format of my dataset?

LZHgrla · 2024-04-29T08:00:42Z

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Luo-Z13 · 2024-04-29T08:04:36Z

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.

Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you very much, I will try them.

Luo-Z13 · 2024-04-30T02:25:23Z

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.

Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you for your suggestions. The loss is now normal, but there is a new problem: after training a few batches, the iteration speed becomes very slow, as shown:

...
04/30 01:03:42 - mmengine - INFO - Iter(train) [  100/23072]  lr: 2.8656e-06  eta: 1 day, 13:54:11  time: 5.0159  data_time: 0.0117  memory: 9195  loss: 1.8088
04/30 01:04:31 - mmengine - INFO - Iter(train) [  110/23072]  lr: 3.1550e-06  eta: 1 day, 13:16:56  time: 4.8975  data_time: 0.0158  memory: 9167  loss: 1.3998
04/30 01:05:52 - mmengine - INFO - Iter(train) [  120/23072]  lr: 3.4444e-06  eta: 1 day, 14:26:13  time: 8.0493  data_time: 0.0101  memory: 9146  loss: 1.3203
04/30 01:06:42 - mmengine - INFO - Iter(train) [  130/23072]  lr: 3.7339e-06  eta: 1 day, 13:56:50  time: 5.0641  data_time: 0.0210  memory: 9125  loss: 1.2123
04/30 01:07:35 - mmengine - INFO - Iter(train) [  140/23072]  lr: 4.0233e-06  eta: 1 day, 13:37:29  time: 5.2818  data_time: 0.0184  memory: 9104  loss: 1.0494
04/30 03:07:10 - mmengine - INFO - Iter(train) [  150/23072]  lr: 4.3127e-06  eta: 14 days, 3:40:15  time: 717.5106  data_time: 0.0726  memory: 9090  loss: 0.8822
04/30 03:42:26 - mmengine - INFO - Iter(train) [  160/23072]  lr: 4.6022e-06  eta: 16 days, 18:28:43  time: 211.6158  data_time: 0.1037  memory: 9069  loss: 0.8258
04/30 06:08:50 - mmengine - INFO - Iter(train) [  170/23072]  lr: 4.8916e-06  eta: 29 days, 11:20:41  time: 878.3878  data_time: 0.1637  memory: 9055  loss: 0.7201
04/30 09:23:10 - mmengine - INFO - Iter(train) [  180/23072]  lr: 5.1810e-06  eta: 44 days, 23:40:10  time: 1165.9963  data_time: 0.1712  memory: 9041  loss: 0.7931

What could be the cause of this? @LZHgrla

LZHgrla · 2024-05-08T04:47:36Z

@Luo-Z13
It seems to the fluctuations in machine performance. Can this issue be reliably reproduced and which commands you used?

pppppM assigned LZHgrla May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Program hangs with no output.】 #626

【Program hangs with no output.】 #626

Luo-Z13 commented Apr 28, 2024

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024 •

edited

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024 •

edited

LZHgrla commented Apr 29, 2024 •

edited

Luo-Z13 commented Apr 29, 2024

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024

Luo-Z13 commented Apr 30, 2024 •

edited

LZHgrla commented May 8, 2024

【Program hangs with no output.】 #626

【Program hangs with no output.】 #626

Comments

Luo-Z13 commented Apr 28, 2024

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024 • edited

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024 • edited

LZHgrla commented Apr 29, 2024 • edited

Luo-Z13 commented Apr 29, 2024

LZHgrla commented Apr 29, 2024

Luo-Z13 commented Apr 29, 2024

Luo-Z13 commented Apr 30, 2024 • edited

LZHgrla commented May 8, 2024

Luo-Z13 commented Apr 29, 2024 •

edited

Luo-Z13 commented Apr 29, 2024 •

edited

LZHgrla commented Apr 29, 2024 •

edited

Luo-Z13 commented Apr 30, 2024 •

edited