Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

matrixssy · 2024-04-26T08:33:41Z

I have been fine-tuning Mistral-7B-Instruct-v0.2 recently and I noticed that when I don't use SWA and train with a sequence length of 32K, the initial loss is unusually high (6.0). However, when I train with a sequence length of 4096, the loss is normal (1.5). This has led me to suspect that Mistral-7B-Instruct-v0.2 might have been trained with a sliding window of 4096 instead of not using it as officially stated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

matrixssy commented Apr 26, 2024

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

Did Mistral-7B-Instruct-v0.2 use Sliding Window Attention (SWA)? #148

Comments

matrixssy commented Apr 26, 2024