Batched Prefill #219

lucasavila00 · 2024-04-27T01:18:18Z

No description provided.

github-actions · 2024-04-27T01:18:39Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Total                       70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 69,864
Estimated Schedule Effort 11.811066 months
Estimated People Required 5.038645
───────────────────────────────────────────────────────────────────────────────
Processed 768517 bytes, 0.769 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

EricLBuehler

I think this looks good as an initial implementation! One improvement that I see is to allow a batch size greater than one. To do this, we should modify the scheduler so that it schedules prompt sequences with token length > 512 together. Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present. I think the latter is definitely more complicated, and so it is probably a worse option, although I am not sure about the performance cost.

If we are to continue with that approach we must change the scheduler to forbid it from scheduling more than 512 tokens for completion

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

lucasavila00 · 2024-04-27T03:31:04Z

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4

One improvement that I see is to allow a batch size greater than one.

Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill.

So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 586.483±0.000 | 1.705±0.000 |           1 |     586.4834 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 344.665±0.348 | 2.901±0.003 |           2 |     689.3309 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 228.368±0.166 | 4.379±0.003 |           3 |     685.1029 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 172.260±0.111 | 5.805±0.004 |           4 |     689.0406 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| model                              | backend | test  | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 292.237±0.000 | 3.422±0.000 |           1 |    292.23743 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 308.450±2.230 | 3.242±0.023 |           2 |     616.8997 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 206.233±0.828 | 4.849±0.019 |           3 |     618.6996 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 172.977±0.872 | 5.781±0.029 |           4 |     691.9095 |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+

Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present

This could work...

In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything.

lucasavila00 · 2024-04-27T04:23:24Z

@EricLBuehler if we remove parallel prefill then I think the approach of a165b7d might work?

lucasavila00 · 2024-04-27T04:39:15Z

So my current reasoning is:

We already have natural parallelization of prefill, as we run multiple tokens at once
Dealing with cache and batches for parallel prefill of multiple sequences is complex
If the prompt size is reasonably big, we're almost at full speed
If we have any padding, it'd be better to run in sequence but with no padding, due to small gains of running multiple sequence prompts in parallel

So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel.

EricLBuehler

I like this approach. Can you please implement it for the rest of the models?

Async channels

lucasavila00 · 2024-04-28T03:04:01Z

I cherry-picked the proper commits to #234 and I'll close this

Scaffold Batched Prefill

5fd0459

EricLBuehler reviewed Apr 27, 2024

View reviewed changes

EricLBuehler linked an issue Apr 27, 2024 that may be closed by this pull request

Batched & chunked prefill #216

Open

remove parallel prefill

a165b7d

fix masks?

687fb68

lucasavila00 marked this pull request as ready for review April 27, 2024 04:30

lucasavila00 requested a review from EricLBuehler April 27, 2024 15:29

EricLBuehler requested changes Apr 28, 2024

View reviewed changes

EricLBuehler added new feature New feature or request optimization models Additions to model or architectures labels Apr 28, 2024

lucasavila00 and others added 6 commits April 27, 2024 22:53

Use async channels

a9ea35e

Merge pull request EricLBuehler#233 from lucasavila00/async_chan2

fadab8e

Async channels

Update readme

c8d9df0

Merge branch 'master' into batched_prefill

d66e860

re-add proper calls

970aabc

implement it for all models

f3047b5

EricLBuehler force-pushed the master branch from c8d9df0 to dc9357d Compare April 28, 2024 02:54

clippy

38af75a

lucasavila00 mentioned this pull request Apr 28, 2024

Batched & chunked prefill #234

Closed

lucasavila00 closed this Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched Prefill #219

Batched Prefill #219

lucasavila00 commented Apr 27, 2024

github-actions bot commented Apr 27, 2024 •

edited

EricLBuehler left a comment

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

EricLBuehler left a comment

lucasavila00 commented Apr 28, 2024

Batched Prefill #219

Batched Prefill #219

Conversation

lucasavila00 commented Apr 27, 2024

github-actions bot commented Apr 27, 2024 • edited

EricLBuehler left a comment

Choose a reason for hiding this comment

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

lucasavila00 commented Apr 28, 2024

github-actions bot commented Apr 27, 2024 •

edited