New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN during llama3 finetuning #427
Comments
Are you training on |
Hi @danielhanchen, Thank you for your response. I'm unsure about the inner workings of get_peft_model in Unsloth, but assuming it functions similarly to other peft methods, it should freeze the base model, including the embedding matrix, correct? Consequently, I believe my scripts are only training the Lora parameters. I attempted to use Unsloth's fix_untrained_tokens, but it didn't work out for me. Additionally, I noticed that Unsloth's blog mentions the llama-3-8b base model, whereas I'm using the llama-3-8b-instruct model. Instruct model's reserved tokens should not arise any issues as they are finetuned (unlike base model) right? |
@mano3-1 what does the traceback say if you run
|
Hi @lapp0,
|
I'm running into issues with back-propagation in unsloth as well, albeit I'm using a custom loss function and Mistral instead of llama-3. It works fine for `RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.
I'd be interested in the cause of your issue, perhaps it is the same as mine. If I figure anything out with mine I'll let you know. |
Hi @lapp0 |
I'm not sure. Your backwards step where it fails is a different layer of the model than me, but the only thing our scripts have in common is unsloth. How about some debug details?
|
Here is the pip freeze: Here is the full training script: link This is how I trigger the training scripts: you may set hf_token to string "None", if you are loading unsloth models I guess. |
requirements.txt isn't the same as pip freeze. |
Oh no sorry guys - i will take a look |
Thanks @danielhanchen Here is my reproduction script as well, run on a 4090 with cuda 12.1. @mano3-1 has a standard SFT script so his is probably worth looking at first.
|
Hi @lapp0 , |
Sorry about my confusion @mano3-1 I reviewed and compared our installed packages. Nothing noteworthy in the shared dependencies, other than perhaps the issue is related to the use of xformers. Will experiment with this later.
|
Thanks for the code repro - will test this out - sorry on the issue again! |
Also facing same issue. While using colab and the standard notebook in the unsloth folder. Thought to add. |
hey, |
Sorry guys just started debugging this.
For Colab / Kaggle should be fine with a restart @DementedWeasel1971 When you said the colab notebook we provided broke, could you point to exactly which one thanks. @mano3-1 Extremely weird actually - I reran Colab with Instruct and it seems fine - would you be able to run just the conversational notebook for Llama-3 here: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing @lapp0 I'm currently running your PPO example here: https://colab.research.google.com/drive/1fgJv0eKlRKexOl2RqcxoiZ-HhGrdNWQW?usp=sharing (will wait for it to complete) |
Thank so much for looking into it! Unfortunately I'm still getting
Please let me know if there's any other debug details that would help. Also fyi, to speed up debugging you can set Edit: I pushed a bad commit to my branch, I reverted the broken change. Should be good to try again with head of |
Hi,
I'm currently fine-tuning llama3-instruct-8b on a custom dataset using unsloth's FastLanguageModel. I'm using Hugging Face's SFTTrainer to train the model. Surprisingly, the gradient norm and evaluation loss become NaN after a few steps. I've seen a blog from unsloth mentioning that NaNs may appear due to a bug, but they also mentioned that the bug was fixed by Hugging Face and unsloth now (here, under the llama3-Quirks section). So, I not only updated unsloth and Hugging Face but also added the "pad_token" mentioned in the blog. Despite these attempts, the NaN problem still persists. Is there something else that I'm missing? Can someone help me out with this?
Here's the code snippet of how I'm loading the model:
Following is the training code:
The text was updated successfully, but these errors were encountered: