Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve warmup checking for max new tokens when using speculative decoding #474

Open
tgaddair opened this issue May 17, 2024 · 0 comments
Open
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@tgaddair
Copy link
Contributor

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.

@tgaddair tgaddair added bug Something isn't working good first issue Good for newcomers labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant