You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.
The text was updated successfully, but these errors were encountered:
I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.
The text was updated successfully, but these errors were encountered: