LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

phymbert · 2024-04-17T19:28:57Z

phymbert
Apr 17, 2024
Collaborator

Performances and improvment area

This thread objective is to gather llama.cpp performance 📈 and improvement ideas💡against other popular LLM inference
frameworks, especially on the CUDA backend. Let's try to fill the gap 🚀

Hugging Face TGI: A Rust, Python and gRPC server for text
generation inference.
vLLM: Easy, fast, and cheap LLM serving for everyone.

I have run a couple of benchmarks from the OpenAI /chat/completions endpoint client point of view
using JMeter on 2 A100 with mixtral8x7b and a fine tune llama70b models.

Note 1: from the client point of view, it is not possible to get accurate PP and TG because, first you need steaming
enabled and then PP will always include one generated token. So easier to compare the total tokens of the transactions
in completions.usage.

Note 2: from a performance tests server point of view, we generally consider following metrics:

iterations: total request successfully completed during the test
prompt tokens: average prompt tokens per request, same by iteration number for all tests
generated tokens: average generated tokens per request
RPM: Requests Per Minute
latency: Duration of the http request in seconds
PP+TG: total tokens http clients send and receive per second
errors: number of request in errors during the test from the client point of view, it can be http timeout,
connection close. It is not necessarily caused by the server.

Context size

The transaction tokens context here is:

	min	avg	max
Prompt tokens	12	155	512
Generated tokens	38	133	996

Results

llama70b @ `eedd42e`

metric	users	duration	llama.cpp	vLLM	TGI
iterations	32	30m	1514	4 734	4 448
prompt tokens	32	30m	155.19	125.58	138.66
generated tokens	32	30m	132.71	144.61	125.66
RPM	32	30m	40.92	147.94	139.00
latency	32	30m	65.10	17.06	16.79
PP+TG/s	32	30m	196.34	649.46	612.33
errors	32	30m	0.19%	0.09%	0.05%
iterations	1	10m	48	99	85
prompt tokens	1	10m	120.31	116.07	135.53
generated tokens	1	10m	115.56	121.28	124.69
RPM	1	10m	4.36	9.00	7.73
latency	1	10m	12.83	6.41	7.30
PP+TG	1	10m	17.15	35.60	33.51
errors	1	10m	0%	0%	0%

llama.cpp configuration

server  --model myllama70b-f16-00001-of-00010.gguf \
        --ctx-size 32768 \
        --n-predict 4096 \
        --n-gpu-layers 81 \
        --batch-size 4096 \
        --ubatch-size 256 \
        --parallel 1|32 \
        --metrics \
        --log-format text

vLLM configuration

python -m vllm.entrypoint.openai.api_server \
   --model /models/myllama70b \
   --tensor-parallel-size 2

TGI Configuration

Please note how it is easy:

text-generation-launch --model-id /models/myllama70b

mixtral8x7b

metric	users	duration	llama.cpp	vLLM	TGI
iterations	32	30m	4 152	10 541	10 849
prompt tokens	32	30m	83.56	97.63	98.63
generated tokens	32	30m	110.27	166.31	98.18
RPM	32	30m	129.75	329.41	339.03
latency	32	30m	20.90	6.04	5.83
PP+TG/s	32	30m	409.81	1 449.13	1 487.45
errors	32	30m	1.29%	0.06%	0.05%
iterations	1	10m	219	439	430
prompt tokens	1	10m	79.24	94.09	98.64
generated tokens	1	10m	103.02	107.65	108.14
RPM	1	10m	19.91	39.91	39.09
PP+TG	1	10m	60.48	134.19	134.72
errors	1	10m	0%	0%	0%

llama.cpp configuration @ `137fbb8`

server  --model mixtral-8x7b-instruct-f16-00001-of-00010.gguf \
        --ctx-size 131072 \
        --n-predict 4096 \
        --n-gpu-layers 33 \
        --batch-size 4096 \
        --ubatch-size 256 \
        --parallel 1|32 \
        --metrics \
        --log-format text

Magically vLLM and TGI configuration are not changed.

Area of improvements of `llama.cpp`

Please @ggerganov edit as will

ggml : group all experts in a single ggml_mul_mat_id #6505
ggml : add Flash Attention #5021
Implement automatic NGL detection #6502
Automatic KV Cache size detection
Automatic batch size based on the underlying hardware
server: process prompt fairly accross slots #6607

phymbert · 2024-04-17T19:30:02Z

phymbert
Apr 17, 2024
Collaborator Author

@JohannesGaessler @slaren as you are the main contributors on CUDA backend feel free to highlight or amend any hypothesis. Thanks a lot for your impressive job here.

25 replies

phymbert Apr 18, 2024
Collaborator Author

What are you calling file format, please ? I just took the files on HF, safetensors bfloat16

phymbert Apr 18, 2024
Collaborator Author

basically was wondering what is the fastest way to do inference with what file format

llama.cpp on GGUF of course

ggerganov Apr 19, 2024
Maintainer

If you guide me a little on how I can convert a bfloat16 HF model to the quantized version, I can give it a try on my spare time.

I'm not familiar with the other frameworks, so would be of little help here. No worries, not very important

OB-SPrince May 2, 2024

I think the llama.cpp need to use the real model, and not a quantized gguf version. Then it would be proper comparison.

phymbert May 2, 2024
Collaborator Author

Please develop

aneeshmb02 · 2024-05-10T09:21:18Z

aneeshmb02
May 10, 2024

I have the same objective. What data/prompt did you use and can I use the same with llama-bench?
Started the same discussion at #7195.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 25 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

phymbert Apr 17, 2024 Collaborator

Performances and improvment area

Context size

Results

llama70b @ eedd42e

llama.cpp configuration

vLLM configuration

TGI Configuration

mixtral8x7b

llama.cpp configuration @ 137fbb8

Area of improvements of llama.cpp

Replies: 2 comments · 25 replies

phymbert Apr 17, 2024 Collaborator Author

phymbert Apr 18, 2024 Collaborator Author

phymbert Apr 18, 2024 Collaborator Author

ggerganov Apr 19, 2024 Maintainer

OB-SPrince May 2, 2024

phymbert May 2, 2024 Collaborator Author

aneeshmb02 May 10, 2024

phymbert
Apr 17, 2024
Collaborator

llama70b @ `eedd42e`

llama.cpp configuration @ `137fbb8`

Area of improvements of `llama.cpp`

Replies: 2 comments 25 replies

phymbert
Apr 17, 2024
Collaborator Author

phymbert Apr 18, 2024
Collaborator Author

phymbert Apr 18, 2024
Collaborator Author

ggerganov Apr 19, 2024
Maintainer

phymbert May 2, 2024
Collaborator Author

aneeshmb02
May 10, 2024