Replies: 2 comments 25 replies
-
@JohannesGaessler @slaren as you are the main contributors on CUDA backend feel free to highlight or amend any hypothesis. Thanks a lot for your impressive job here. |
Beta Was this translation helpful? Give feedback.
25 replies
-
I have the same objective. What data/prompt did you use and can I use the same with llama-bench? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Performances and improvment area
This thread objective is to gather
llama.cpp
performance 📈 and improvement ideas💡against other popular LLM inferenceframeworks, especially on the
CUDA
backend. Let's try to fill the gap 🚀generation inference.
I have run a couple of benchmarks from the OpenAI
/chat/completions
endpoint client point of viewusing JMeter on 2 A100 with
mixtral8x7b
and a fine tunellama70b
models.Note 1: from the client point of view, it is not possible to get accurate PP and TG because, first you need steaming
enabled and then PP will always include one generated token. So easier to compare the total tokens of the transactions
in
completions.usage
.Note 2: from a performance tests server point of view, we generally consider following metrics:
iterations
: total request successfully completed during the testprompt tokens
: average prompt tokens per request, same by iteration number for all testsgenerated tokens
: average generated tokens per requestRPM
: Requests Per Minutelatency
: Duration of the http request in secondsPP+TG
: total tokens http clients send and receive per seconderrors
: number of request in errors during the test from the client point of view, it can be http timeout,connection close. It is not necessarily caused by the server.
Context size
The transaction tokens context here is:
Results
llama70b @ eedd42e
llama.cpp configuration
server --model myllama70b-f16-00001-of-00010.gguf \ --ctx-size 32768 \ --n-predict 4096 \ --n-gpu-layers 81 \ --batch-size 4096 \ --ubatch-size 256 \ --parallel 1|32 \ --metrics \ --log-format text
vLLM configuration
TGI Configuration
Please note how it is easy:
mixtral8x7b
llama.cpp configuration @ 137fbb8
server --model mixtral-8x7b-instruct-f16-00001-of-00010.gguf \ --ctx-size 131072 \ --n-predict 4096 \ --n-gpu-layers 33 \ --batch-size 4096 \ --ubatch-size 256 \ --parallel 1|32 \ --metrics \ --log-format text
Magically vLLM and TGI configuration are not changed.
Area of improvements of
llama.cpp
Please @ggerganov edit as will
Beta Was this translation helpful? Give feedback.
All reactions