Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched & chunked prefill #216

Open
lucasavila00 opened this issue Apr 26, 2024 · 2 comments
Open

Batched & chunked prefill #216

lucasavila00 opened this issue Apr 26, 2024 · 2 comments
Labels
models Additions to model or architectures new feature New feature or request

Comments

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 26, 2024

Similar to what was described here huggingface/candle#2108

"When prompts get longer than trivial sizes, the memory usage spikes as the prompt is thrown into one Tensor and sent off to a forward pass in the model at whatever length it comes in as. These spikes can be reduced by processing the batch in chunks."

There's a candle implementation here huggingface/candle#2111

Let's say we configure a setting batch_size = 512.

The scheduler would need to be aware of it and only schedule 2 prompts if they're less than 512 tokens combined.

And the engine should be aware of it and if a sequence is larger than 512 tokens, split it.

To reproduce it locally, use the benchmark with a high enough -p and you get an OOM

./mistralrs-bench -p 2048 -g 0 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/
Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

2024-04-26T20:17:25.483829Z ERROR mistralrs_core::engine: prompt - Model failed with error: Cuda(Cuda(DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")))

But generating this same amount of tokens work

./mistralrs-bench -p 0 -g 2048 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/
Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+
| model                              | backend | test    | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 2048 | 26.297±0.000 | 38.027±0.000 |           1 |    26.296867 |
+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+
@EricLBuehler EricLBuehler added new feature New feature or request urgent labels Apr 26, 2024
@EricLBuehler
Copy link
Owner

@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a layers.rs in mistralrs-core.

@EricLBuehler EricLBuehler modified the milestone: Version 0.1.0 Apr 26, 2024
@EricLBuehler EricLBuehler linked a pull request Apr 27, 2024 that will close this issue
@EricLBuehler EricLBuehler added models Additions to model or architectures and removed urgent labels Apr 28, 2024
@lucasavila00 lucasavila00 changed the title Batched prefill Batched & chunkled prefill Apr 29, 2024
@lucasavila00 lucasavila00 changed the title Batched & chunkled prefill Batched & chunked prefill Apr 29, 2024
@EricLBuehler
Copy link
Owner

@lucasavila00, I am actually going to end up adding this in #242.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants