training a LLM w/ shifted sparse attention from the scratch? #173

we1k · 2024-01-24T03:10:21Z

thanks for the great work. i’m curious about the model performance if replacing GQA with S^2 attention at the beginning, then train the model from the scratch, will there be a degradation in performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training a LLM w/ shifted sparse attention from the scratch? #173

training a LLM w/ shifted sparse attention from the scratch? #173

we1k commented Jan 24, 2024

training a LLM w/ shifted sparse attention from the scratch? #173

training a LLM w/ shifted sparse attention from the scratch? #173

Comments

we1k commented Jan 24, 2024