Implement Speculative Decoding #242

EricLBuehler · 2024-04-28T22:12:13Z

Speculative decoding: https://arxiv.org/pdf/2211.17192

This will refactor the pipeline structure to make the sampling process more abstracted. Additionally, it will also abstract the scheduling and kv cache management.

Restriction

Requires same vocab

Algorithm

Given draft model q and target model p with probability distributions $q_i(x)$ and $p_i(x)$ for each token

Keep the sample for token i if $q_i(x)$ <= $p_i(x)$
- This means the target model agrees
Else (if $q_i(x)$ > $p_i(x)$ ) accept that token with prob $\frac{p_i(x)}{q_i(x)}$
- If rejected, sample token from from $p'(x) = norm(max(0, p(x) − q(x)))$ and do not take any more
- Note that that is really $p'(x) = ReLU(p(x) − q(x))$

github-actions · 2024-04-28T22:12:28Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Total                       72     23863     1572       530    21761       1325
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 85,737
Estimated Schedule Effort 11.916649 months
Estimated People Required 5.112342
───────────────────────────────────────────────────────────────────────────────
Processed 793364 bytes, 0.793 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

kir-gadjello · 2024-04-29T02:20:44Z

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

EricLBuehler · 2024-04-29T02:24:11Z

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

kir-gadjello · 2024-05-01T21:05:05Z

It would be very useful to relax the requirement of exact same tokenizer for main and draft models like here: vllm-project/vllm#2188

Yes, this implementation only checks if the vocabs are the same: see this check.

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

EricLBuehler · 2024-05-01T21:22:13Z

I understand that same vocab case is much easier to code, but if this requirement is relaxes people can use a ready-made small draft model even if their LLM is incompatible with it (which often will be the case).

That sounds great! Can you please give an example of how I should relax the requirement?

EricLBuehler · 2024-05-11T02:07:03Z

This PR adds the base framework for SD. Further improvements to speed will be added in addition to self-speculative decoding.

EricLBuehler added 2 commits April 28, 2024 18:07

Temp

e9610e4

Merge branch 'master' into speculative

097b8e0

EricLBuehler added new feature New feature or request backend Backend work models Additions to model or architectures labels Apr 28, 2024

EricLBuehler marked this pull request as draft April 28, 2024 22:13

Temp

63c4c86

EricLBuehler added 11 commits April 30, 2024 05:42

Merge branch 'master' into speculative

3871180

Merge branch 'master' into speculative

fa01657

Use arc for chat template

945e985

Update reset_non_granular_state

462dea3

Merge branch 'master' into speculative

aeb2460

Begin abstraction of sampling

ec24cf2

Begin abstraction of sampling

e7f36ca

Abstract sampling process

72b5755

Abstract sampling process

4dc5db1

Almost there!

cc79226

It compiles

ae4877a

EricLBuehler added 5 commits May 1, 2024 18:04

Implement the rest of the todos

07caf41

Implement the rest of the todos

5fb731a

Use cache instructions

5adef60

Clippy

d228757

Add to server api

254e405

EricLBuehler marked this pull request as ready for review May 2, 2024 00:51

Update cache manager

6bb0971

EricLBuehler added 26 commits May 6, 2024 08:24

Fix len and broadcast div

7777648

Fix seqlen when using tmp toks

8f311f2

Narrow caches correctly

f67b9e9

Add toml selecter interface

7875fd8

Add initial config file

fdaa712

Fix deser

3bc6405

Fix filename

d350121

Fixes

21e6d3e

Add same gguf toml

cc2f60a

Merge

f5c9970

Merge branch 'master' into speculative

9fe7591

Cache last (not working yet

b382602

Merge branch 'master' into speculative

19afd58

Use causal masker

482859f

Merge branch 'master' into speculative

bf8878a

Merge

4adbd6d

Merge branch 'master' into speculative

281040a

Merge branch 'master' into speculative

2ab9d3b

It works

830478e

It works

9d78b4e

Add speculative api to runner

8fa467b

Docs

7b8ac2a

Fix

cfc9a0c

Fix

74a9d0d

Fix deadlock

0b3ba2c

More masking fixes

d630c4a

EricLBuehler merged commit ce8028e into master May 11, 2024
10 checks passed

EricLBuehler deleted the speculative branch May 11, 2024 02:07

EricLBuehler mentioned this pull request May 19, 2024

bug: If device layers requested exceed model layers, host layers overflow #329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Speculative Decoding #242

Implement Speculative Decoding #242

EricLBuehler commented Apr 28, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited

kir-gadjello commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

kir-gadjello commented May 1, 2024

EricLBuehler commented May 1, 2024 •

edited

EricLBuehler commented May 11, 2024

Implement Speculative Decoding #242

Implement Speculative Decoding #242

Conversation

EricLBuehler commented Apr 28, 2024 • edited

Restriction

Algorithm

github-actions bot commented Apr 28, 2024 • edited

kir-gadjello commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

kir-gadjello commented May 1, 2024

EricLBuehler commented May 1, 2024 • edited

EricLBuehler commented May 11, 2024

EricLBuehler commented Apr 28, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited

EricLBuehler commented May 1, 2024 •

edited