Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Program hangs with no output.】 #626

Open
Luo-Z13 opened this issue Apr 28, 2024 · 11 comments
Open

【Program hangs with no output.】 #626

Luo-Z13 opened this issue Apr 28, 2024 · 11 comments
Assignees

Comments

@Luo-Z13
Copy link

Luo-Z13 commented Apr 28, 2024

I am conducting the instruction tuning of llama3_llava using the script on my own dataset
NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero3_offload --seed 1024. After the following output, the program stops outputting but is still running:

 - mmengine - INFO - Iter(train) [   10/23076]  lr: 1.3034e-07  eta: 3 days, 3:30:35  time: 11.7851  data_time: 0.0298  memory: 15547  loss: nan
 - mmengine - INFO - Iter(train) [   20/23076]  lr: 2.7506e-07  eta: 3 days, 5:46:56  time: 12.5050  data_time: 0.0199  memory: 9964  loss: nan

This state has been ongoing for 2 hours. What could be the possible cause for this?

@LZHgrla
Copy link
Collaborator

LZHgrla commented Apr 29, 2024

@Luo-Z13
The total number of iterations is a bit strange. Did you modify the settings in config?

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 29, 2024

@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?

My script:

NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
                                       --deepspeed deepspeed_zero3_offload --seed 1024

The training schedule:

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip
warmup_ratio = 0.03

Then, I modify the save_steps, and change other parts about the paths to my own data or local paths. Besides that, there were no other changes.

@LZHgrla
Copy link
Collaborator

LZHgrla commented Apr 29, 2024

@Luo-Z13
How many GPUs are you using for training?

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 29, 2024

@Luo-Z13 How many GPUs are you using for training?

I use 4*A100(40G)

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 29, 2024

@Luo-Z13 How many GPUs are you using for training?

And the pre-training of LLaVA-llama3 is normal.

@LZHgrla
Copy link
Collaborator

LZHgrla commented Apr 29, 2024

@Luo-Z13

Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 29, 2024

@Luo-Z13

Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?

Hello, I'm using my own instruction-tuning data, so the total number of iterations is different. Do I need to check the format of my dataset?

@LZHgrla
Copy link
Collaborator

LZHgrla commented Apr 29, 2024

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 29, 2024

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you very much, I will try them.

@Luo-Z13
Copy link
Author

Luo-Z13 commented Apr 30, 2024

@Luo-Z13

Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.

Additionally, here are some other suggestions:

  1. Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
  2. Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)

Thank you for your suggestions. The loss is now normal, but there is a new problem: after training a few batches, the iteration speed becomes very slow, as shown:

...
04/30 01:03:42 - mmengine - INFO - Iter(train) [  100/23072]  lr: 2.8656e-06  eta: 1 day, 13:54:11  time: 5.0159  data_time: 0.0117  memory: 9195  loss: 1.8088
04/30 01:04:31 - mmengine - INFO - Iter(train) [  110/23072]  lr: 3.1550e-06  eta: 1 day, 13:16:56  time: 4.8975  data_time: 0.0158  memory: 9167  loss: 1.3998
04/30 01:05:52 - mmengine - INFO - Iter(train) [  120/23072]  lr: 3.4444e-06  eta: 1 day, 14:26:13  time: 8.0493  data_time: 0.0101  memory: 9146  loss: 1.3203
04/30 01:06:42 - mmengine - INFO - Iter(train) [  130/23072]  lr: 3.7339e-06  eta: 1 day, 13:56:50  time: 5.0641  data_time: 0.0210  memory: 9125  loss: 1.2123
04/30 01:07:35 - mmengine - INFO - Iter(train) [  140/23072]  lr: 4.0233e-06  eta: 1 day, 13:37:29  time: 5.2818  data_time: 0.0184  memory: 9104  loss: 1.0494
04/30 03:07:10 - mmengine - INFO - Iter(train) [  150/23072]  lr: 4.3127e-06  eta: 14 days, 3:40:15  time: 717.5106  data_time: 0.0726  memory: 9090  loss: 0.8822
04/30 03:42:26 - mmengine - INFO - Iter(train) [  160/23072]  lr: 4.6022e-06  eta: 16 days, 18:28:43  time: 211.6158  data_time: 0.1037  memory: 9069  loss: 0.8258
04/30 06:08:50 - mmengine - INFO - Iter(train) [  170/23072]  lr: 4.8916e-06  eta: 29 days, 11:20:41  time: 878.3878  data_time: 0.1637  memory: 9055  loss: 0.7201
04/30 09:23:10 - mmengine - INFO - Iter(train) [  180/23072]  lr: 5.1810e-06  eta: 44 days, 23:40:10  time: 1165.9963  data_time: 0.1712  memory: 9041  loss: 0.7931

What could be the cause of this? @LZHgrla

@LZHgrla
Copy link
Collaborator

LZHgrla commented May 8, 2024

@Luo-Z13
It seems to the fluctuations in machine performance. Can this issue be reliably reproduced and which commands you used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants