Token-wise the same generalization? #99

Ageliss · 2024-04-16T07:10:08Z

Is Medusa1 model generalize token-wise the same as the base model w.o. medusa head?

I found change medusa choices will change the output.

Ageliss · 2024-05-21T09:24:48Z

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.

Ageliss · 2024-05-21T09:25:23Z

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.

Ageliss closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token-wise the same generalization? #99

Token-wise the same generalization? #99

Ageliss commented Apr 16, 2024 •

edited

Ageliss commented May 21, 2024

Ageliss commented May 21, 2024

Token-wise the same generalization? #99

Token-wise the same generalization? #99

Comments

Ageliss commented Apr 16, 2024 • edited

Ageliss commented May 21, 2024

Ageliss commented May 21, 2024

Ageliss commented Apr 16, 2024 •

edited