Improve warmup checking for max new tokens when using speculative decoding #474

tgaddair · 2024-05-17T22:29:35Z

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.

tgaddair added bug Something isn't working good first issue Good for newcomers labels May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve warmup checking for max new tokens when using speculative decoding #474

Improve warmup checking for max new tokens when using speculative decoding #474

tgaddair commented May 17, 2024

Improve warmup checking for max new tokens when using speculative decoding #474

Improve warmup checking for max new tokens when using speculative decoding #474

Comments

tgaddair commented May 17, 2024