Batched & chunked prefill #216

lucasavila00 · 2024-04-26T20:01:01Z

Similar to what was described here huggingface/candle#2108

"When prompts get longer than trivial sizes, the memory usage spikes as the prompt is thrown into one Tensor and sent off to a forward pass in the model at whatever length it comes in as. These spikes can be reduced by processing the batch in chunks."

There's a candle implementation here huggingface/candle#2111

Let's say we configure a setting batch_size = 512.

The scheduler would need to be aware of it and only schedule 2 prompts if they're less than 512 tokens combined.

And the engine should be aware of it and if a sequence is larger than 512 tokens, split it.

To reproduce it locally, use the benchmark with a high enough -p and you get an OOM

./mistralrs-bench -p 2048 -g 0 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/
Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

2024-04-26T20:17:25.483829Z ERROR mistralrs_core::engine: prompt - Model failed with error: Cuda(Cuda(DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")))

But generating this same amount of tokens work

./mistralrs-bench -p 0 -g 2048 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/
Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+
| model                              | backend | test    | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 2048 | 26.297±0.000 | 38.027±0.000 |           1 |    26.296867 |
+------------------------------------+---------+---------+--------------+--------------+-------------+--------------+

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-04-26T20:58:05Z

@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a layers.rs in mistralrs-core.

EricLBuehler · 2024-05-04T09:13:51Z

@lucasavila00, I am actually going to end up adding this in #242.

EricLBuehler added new feature New feature or request urgent labels Apr 26, 2024

EricLBuehler modified the milestone: Version 0.1.0 Apr 26, 2024

EricLBuehler linked a pull request Apr 27, 2024 that will close this issue

Batched Prefill #219

Closed

lucasavila00 mentioned this issue Apr 28, 2024

Batched & chunked prefill #234

Closed

EricLBuehler added models Additions to model or architectures and removed urgent labels Apr 28, 2024

lucasavila00 changed the title ~~Batched prefill~~ Batched & chunkled prefill Apr 29, 2024

lucasavila00 changed the title ~~Batched & chunkled prefill~~ Batched & chunked prefill Apr 29, 2024

polarathene mentioned this issue May 19, 2024

bug: If device layers requested exceed model layers, host layers overflow #329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched & chunked prefill #216

Batched & chunked prefill #216

lucasavila00 commented Apr 26, 2024 •

edited

EricLBuehler commented Apr 26, 2024

EricLBuehler commented May 4, 2024

Batched & chunked prefill #216

Batched & chunked prefill #216

Comments

lucasavila00 commented Apr 26, 2024 • edited

EricLBuehler commented Apr 26, 2024

EricLBuehler commented May 4, 2024

lucasavila00 commented Apr 26, 2024 •

edited