Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched Prefill #219

Closed
wants to merge 10 commits into from
Closed

Conversation

lucasavila00
Copy link
Contributor

No description provided.

Copy link

github-actions bot commented Apr 27, 2024

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Total                       70     23339     1550       508    21281       1281
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 69,864
Estimated Schedule Effort 11.811066 months
Estimated People Required 5.038645
───────────────────────────────────────────────────────────────────────────────
Processed 768517 bytes, 0.769 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good as an initial implementation! One improvement that I see is to allow a batch size greater than one. To do this, we should modify the scheduler so that it schedules prompt sequences with token length > 512 together. Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present. I think the latter is definitely more complicated, and so it is probably a worse option, although I am not sure about the performance cost.

If we are to continue with that approach we must change the scheduler to forbid it from scheduling more than 512 tokens for completion

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

@EricLBuehler EricLBuehler linked an issue Apr 27, 2024 that may be closed by this pull request
@lucasavila00
Copy link
Contributor Author

Is this necessary to avoid the memory spikes? If so, we should make this feature a new SchedulerMethod.

I ran the benchmark and it is not required. The spike only happens if pp=2048, but not pp=512,c=4

One improvement that I see is to allow a batch size greater than one.

Is it worth it supporting parallel prefill? For me, using gguf, it only helps if prefilling <128 tokens. If we have more than 128 tokens, padding makes it slower than non-parallel prefill.

So it's only worth it to run parallel prefill on a batch of very small sequences. Check benchmarks below

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 586.483±0.000 | 1.705±0.000 |           1 |     586.4834 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 344.665±0.348 | 2.901±0.003 |           2 |     689.3309 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 228.368±0.166 | 4.379±0.003 |           3 |     685.1029 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 172.260±0.111 | 5.805±0.004 |           4 |     689.0406 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| model                              | backend | test  | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 292.237±0.000 | 3.422±0.000 |           1 |    292.23743 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 308.450±2.230 | 3.242±0.023 |           2 |     616.8997 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 206.233±0.828 | 4.849±0.019 |           3 |     618.6996 |
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 64 | 172.977±0.872 | 5.781±0.029 |           4 |     691.9095 |
+------------------------------------+---------+-------+---------------+-------------+-------------+--------------+

Alternatively, we could not do that and just modify get_prompt_input to return some sort of iterator over the chunks, for which some sequences would not be present

This could work...

In the end we need to decide whether we support parallel prefill... It makes batching more complicated because we need to be aware of the cache and set it up per batch. And it is only faster if we're not padding anything.

@lucasavila00
Copy link
Contributor Author

@EricLBuehler if we remove parallel prefill then I think the approach of a165b7d might work?

@lucasavila00 lucasavila00 marked this pull request as ready for review April 27, 2024 04:30
@lucasavila00
Copy link
Contributor Author

So my current reasoning is:

  1. We already have natural parallelization of prefill, as we run multiple tokens at once
  2. Dealing with cache and batches for parallel prefill of multiple sequences is complex
  3. If the prompt size is reasonably big, we're almost at full speed
  4. If we have any padding, it'd be better to run in sequence but with no padding, due to small gains of running multiple sequence prompts in parallel

So I think we should move forward with the current approach. It is only worse if we're running the prompt phase of many tiny requests in parallel.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. Can you please implement it for the rest of the models?

@EricLBuehler EricLBuehler added new feature New feature or request optimization models Additions to model or architectures labels Apr 28, 2024
@lucasavila00
Copy link
Contributor Author

I cherry-picked the proper commits to #234 and I'll close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batched & chunked prefill
2 participants