New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Program hangs with no output.】 #626
Comments
@Luo-Z13 |
My script:
The training schedule:
Then, I modify the save_steps, and change other parts about the paths to my own data or local paths. Besides that, there were no other changes. |
@Luo-Z13 |
I use 4*A100(40G) |
And the pre-training of LLaVA-llama3 is normal. |
Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data? |
Hello, I'm using my own instruction-tuning data, so the total number of iterations is different. Do I need to check the format of my dataset? |
Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data. Additionally, here are some other suggestions:
|
Thank you very much, I will try them. |
Thank you for your suggestions. The loss is now normal, but there is a new problem: after training a few batches, the iteration speed becomes very slow, as shown:
What could be the cause of this? @LZHgrla |
@Luo-Z13 |
I am conducting the instruction tuning of llama3_llava using the script on my own dataset
NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero3_offload --seed 1024
. After the following output, the program stops outputting but is still running:This state has been ongoing for 2 hours. What could be the possible cause for this?
The text was updated successfully, but these errors were encountered: