Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Speculative Decoding #242

Merged
merged 79 commits into from May 11, 2024
Merged

Implement Speculative Decoding #242

merged 79 commits into from May 11, 2024

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented Apr 28, 2024

Speculative decoding: https://arxiv.org/pdf/2211.17192

This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.

Restriction

  • Requires same vocab

Algorithm

Given draft model q and target model p with probability distributions $q_i(x)$ and $p_i(x)$ for each token

  • Keep the sample for token i if $q_i(x)$ <= $p_i(x)$
    • This means the target model agrees
  • Else (if $q_i(x)$ > $p_i(x)$ ) accept that token with prob $\frac{p_i(x)}{q_i(x)}$
    • If rejected, sample token from from $p'(x) = norm(max(0, p(x) − q(x)))$ and do not take any more
    • Note that that is really $p'(x) = ReLU(p(x) − q(x))$

@EricLBuehler EricLBuehler added new feature New feature or request backend Backend work models Additions to model or architectures labels Apr 28, 2024
Copy link

github-actions bot commented Apr 28, 2024

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Total                       72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 85,737
Estimated Schedule Effort 11.916649 months
Estimated People Required 5.112342
───────────────────────────────────────────────────────────────────────────────
Processed 793364 bytes, 0.793 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

@EricLBuehler EricLBuehler marked this pull request as draft April 28, 2024 22:13
@kir-gadjello
Copy link

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

@EricLBuehler
Copy link
Owner Author

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

@kir-gadjello
Copy link

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented May 1, 2024

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

That sounds great! Can you please give an example of how I should relax the requirement?

@EricLBuehler EricLBuehler marked this pull request as ready for review May 2, 2024 00:51
@EricLBuehler
Copy link
Owner Author

This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.

@EricLBuehler EricLBuehler merged commit ce8028e into master May 11, 2024
10 checks passed
@EricLBuehler EricLBuehler deleted the speculative branch May 11, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend work models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants