Quantized Mistral: Prompt processing slower than llama.cpp #153

lucasavila00 · 2024-04-16T06:27:33Z

Since generation speed is almost matching llama.cpp after #152 I think it's worth it trying to optimize prompt processing now.

lucasavila00 · 2024-04-27T15:37:54Z

Llama.cpp

/home/lucas/oss/llama.cpp/llama-bench  -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1

Mistral.rs

 "/home/lucas/oss/mistral.rs/target/profiling/mistralrs-bench" -p 512 -g 0 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Llama.cpp does dequant first, then matmul. We're doing dequant and matmul directly.

This issue is useful ggerganov/llama.cpp#3776 where they enable the current approach

EricLBuehler · 2024-04-27T17:00:31Z

@lucasavila00, do you think we should also dequantize to F16 for large batch size? To my understanding, this beneficial because the BLAS implementation of matrix-matrix product is faster than our MMQ kernel as the batch size increases.

lucasavila00 · 2024-04-27T17:13:48Z

@EricLBuehler I'd like to test it...

I tried running the candle example using candle before they added the MMQ kernels, and performance was the same-ish.

I also tried to manually dequantize the QMatMuls of the attention layer and saw no improvements.

If you have a different approach I'd be glad to test it.

lucasavila00 · 2024-04-27T21:36:14Z

huggingface/candle#1706

https://github.com/huggingface/candle-cublaslt

I think we need to dequantize and use these cublastlt kernels? I'll try it

EricLBuehler · 2024-04-27T21:44:48Z

@lucasavila00, that sounds great. Please let me know the results!

lucasavila00 · 2024-04-27T23:03:02Z

@EricLBuehler candle already uses cublaslt, see MR #230

forcing dequantization then matmul

./target/profiling/mistralrs-bench -p 512 -g 0 -r 5 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 286.886±6.405 | 3.487±0.080 |           1 |     286.8858 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

master

./target/profiling/mistralrs-bench -p 512 -g 0 -r 5 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 547.439±18.785 | 1.829±0.065 |           1 |    547.43933 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+

EricLBuehler · 2024-04-27T23:18:20Z

@lucasavila00, that is very interesting. How did you force the dequantization?

lucasavila00 · 2024-04-27T23:19:12Z

@lucasavila00, that is very interesting. How did you force the dequantization?

With the lt_mul function of the MR https://github.com/EricLBuehler/mistral.rs/pull/230/files#diff-da1e6f56f0e565985ccaa246f41d45f33271525bb3ae0d3a776cb282ce797676R27

I forced it for the attention weights and MLP only

EricLBuehler · 2024-04-27T23:34:03Z

@lucasavila00, does llama.cpp also get a similar T/s to our 549? It seems like dequantizing reduces performance severely, but perhaps it is better for bigger batch sizes?

lucasavila00 · 2024-04-27T23:41:19Z

llama.cpp is 1700t/s, I forced it to use bs=512 and pp=512, which should be equal to our pp=512

$ /home/lucas/oss/llama.cpp/llama-bench  -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 | pp 512     |   1747.07 ± 0.00 |

build: 7593639c (2679)

$ ./target/profiling/mistralrs-bench -p 512 -g 0 -r 1 -c 1  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-27T23:38:07.937134Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-27T23:38:07.937150Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-27T23:38:07.937153Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-27T23:38:07.937168Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
[mistralrs-core/src/models/quantized_llama.rs:392:9] &layers.len() = 32
2024-04-27T23:38:09.636351Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-27T23:38:09.667093Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 596.042±0.000 | 1.678±0.000 |           1 |    596.04193 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------

lucasavila00 · 2024-04-27T23:47:54Z

@EricLBuehler when I run llama.cpp and mistral.rs in interactive mode then I get close results...

https://gist.github.com/lucasavila00/0155f94fbf13e988384af53af8841b0f

llama_print_timings: prompt eval time =     706,45 ms /   436 tokens (    1,62 ms per token,   617,17 tokens per second)

2024-04-27T23:46:36.882094Z  INFO mistralrs_core::engine: Prompt[445] Completion[] - 765ms

So I guess our pp benchmark is incorrect in its attempt to match llama.cpp? I'm lost now 😄

lucasavila00 · 2024-04-28T00:29:33Z

Ah, nevermind the above. Llama.cpp samples 700tok/s in CPU. I forgot the ngl param

https://gist.github.com/lucasavila00/646b6f6cb9757d1329dc7296b5f16e3e

llama_print_timings: prompt eval time =     279,40 ms /   436 tokens (    0,64 ms per token,  1560,48 tokens per second)

So llama.cpp is indeed 3x faster, both benchmarks measure correctly etc

lucasavila00 · 2024-04-28T00:33:31Z

When I force de-quantization & matmul, candle uses these volta kernels (and so does forcing cublaslt)

But llama.cpp uses some turning kernels

EricLBuehler · 2024-04-28T01:24:18Z

@lucasavila00, I wonder if it is the volta kernels that are slower than turing? It seems like we spend ~62% of our time in the sgemm function, but llama.cpp spends ~21-27% of their time in h1688gemm.

lucasavila00 · 2024-04-28T01:39:03Z

@EricLBuehler that's seems to be the case. I can't find where the turning kernels come from though. I assume these are from an nvidia library, but I can't figure out why llama.cpp uses a different version from candle/cudarc 🤔

lucasavila00 · 2024-04-28T05:45:46Z

The version differs depending on heuristics

Using this for matmuls I can trigger the turning kernels, but it takes too long on the f32->f16 conversions 🤔

fn lt_mul(xs: &Tensor, w: &QMatMul) -> Result<Tensor> {
    let w = match w {
        QMatMul::QTensor(ref qt) => qt.dequantize(xs.device())?,
        QMatMul::Tensor(w) => w.clone(),
    };

    let w = match *xs.dims() {
        [b1, b2, _, _] => w.broadcast_left((b1, b2))?.t()?,
        [bsize, _, _] => w.broadcast_left(bsize)?.t()?,
        _ => w.t()?,
    };

    let xs = xs.to_dtype(DType::F16)?;

    let w = w.to_dtype(DType::F16)?;

    xs.matmul(&w)?.to_dtype(DType::F32)
}

lucasavila00 · 2024-04-28T06:47:49Z

Llama.cpp can dequantize directly to f16, candle cannot... Maybe it's worth it to raise an issue for direct-f16-dequantization?

EricLBuehler · 2024-04-28T09:15:33Z

@lucasavila00, I have raised an issue.

lucasavila00 · 2024-04-28T15:22:48Z

The PR #238 has the latest iteration of the code.

It uses dequant+matmul only for prompts, and does the matmul in f16.

It also has comparisons of runs between mistralrs-bench and llama-bench, and nvidia profiles of the 2 projects.

lucasavila00 · 2024-04-28T19:52:27Z

I think the current difference is now due to different kernels?

Even though the names of the kernels are almost the same, it seems the ones used by candle are slower.

I'm trying to figure out why they don't use the exact same kernels.

The kernels distribution between llama.cpp and mistral.rs are almost the same. And the overall time matches the discrepancy between those 2 kernels.

EricLBuehler · 2024-04-28T20:35:11Z

If I am not mistaken, our completion performance should also be improved by 60% (like prompt perf) because of the new F16 dequant support?

lucasavila00 · 2024-04-28T20:39:08Z

If I am not mistaken, our completion performance should also be improved by 60% (like prompt perf) because of the new F16 dequant support?

For batch sizes > 8, yes.

For batch sizes <=8 I think we'll want to continue to use MMQ (that's what llama.cpp does)

The cublas MR still has these as TODOs though https://github.com/EricLBuehler/mistral.rs/pull/238/files#diff-da1e6f56f0e565985ccaa246f41d45f33271525bb3ae0d3a776cb282ce797676R20-R22

EricLBuehler · 2024-04-28T20:41:12Z

Ah, ok. I'm interested in how our performance compares to llama.cpp in that situation.

lucasavila00 · 2024-04-28T20:41:47Z

That MR currently uses cublas for prompt and MMQ for completion.

It should be something like cublas for prompt if seq_len > 32, otherwise MMQ.
And for completion it should use MMQ if bs <=8, otherwise cublas.

These are the llama.cpp heuristics if I understood it correctly

lucasavila00 · 2024-04-28T20:42:36Z

Ah, I'm not even benchmarking prompts with batch sizes > 1, because I'm assuming we'll move forwards with #234

EricLBuehler · 2024-04-28T20:43:52Z

Yes, I just need to finish the testing and then I'll merge #234. I am looking forward to Candle adding support for calling hgemm, but if that takes a while I can add it.

lucasavila00 · 2024-04-29T01:18:59Z

I think we're not measuring the same timings as llama.cpp exactly. Prompt timings include a memory transfer and the sampling.

After huggingface/candle#2139 (comment)

If I look at just the nvidia profile of a warmed run, llama.cpp takes ~350ms and mistral.rs takes ~400ms.

That puts llama.cpp at ~1500t/s and mistral.rs at ~1300t/s

EricLBuehler · 2024-04-29T01:21:09Z

@lucasavila00 yes, that is possible. Are they timing the memory transfer and sampling?

lucasavila00 · 2024-04-29T01:27:43Z

@lucasavila00 yes, that is possible. Are they timing the memory transfer and sampling?

No, they're just synchronizing.

I wonder why mistral.rs has this 35ms of DtoH transfer. It happens only at prompt time, so it can't be logits transfer to CPU...

lucasavila00 · 2024-04-29T01:28:53Z

BTW this is llama.cpp, filtered

And mistral.rs, filtered

EricLBuehler · 2024-04-29T01:31:18Z

I wonder why mistral.rs has this 35ms of DtoH transfer. It happens only at prompt time, so it can't be logits transfer to CPU...

Maybe it is our cloning in&out of the cache? I don't think that incurs any dtoh. Can you disable the prefix cacher to make sure it isn't doing anything?

lucasavila00 · 2024-04-29T01:34:24Z

I wonder why mistral.rs has this 35ms of DtoH transfer. It happens only at prompt time, so it can't be logits transfer to CPU...

Maybe it is our cloning in&out of the cache? I don't think that incurs any dtoh. Can you disable the prefix cacher to make sure it isn't doing anything?

I'm using mistralrs-bench, which passes the config to disable it. Maybe it is still doing something?

lucasavila00 · 2024-04-29T01:48:30Z

After the latest iteration, removing all volta gemms:

EricLBuehler · 2024-04-29T01:49:34Z

Ah ok. No, all functions are essentially gated by no_prefix_cache. If we remove those 35 ms dtoh, then we are at 351ms which is basically the same.

EricLBuehler · 2024-04-29T01:50:27Z

The only major dtoh I can think of is during sampling...

lucasavila00 · 2024-04-29T01:55:45Z

Ah ok. No, all functions are essentially gated by no_prefix_cache. If we remove those 35 ms dtoh, then we are at 351ms which is basically the same.

I'm not counting the DtoH, so we're at 380ms regardless of it.

The only major dtoh I can think of is during sampling...

But why it lasts for 30ms just for prompt and not completion? 🤔

lucasavila00 · 2024-04-29T01:57:03Z

This runs p=512, then g=1 (otherwise it crashes, we ignore it from results)

So after every prompt block, it does this DtoH.

After the DtoH it does 1 completion loop, and after it another DtoH but one needs to zoom-in to see it.

lucasavila00 · 2024-04-29T02:04:25Z

I think the rest of the time difference is due to slow attention mask application.

I'm trying to gather evidence and look into improving it here or upstream in candle.

EricLBuehler · 2024-04-29T02:08:13Z

Ah, the htod copy when making the attention mask may be to blame. Perhaps we could pre-generate a bunch (up to 512 tokens) and cache them?

lucasavila00 · 2024-04-29T02:12:08Z

Ah, the htod copy when making the attention mask may be to blame. Perhaps we could pre-generate a bunch (up to 512 tokens) and cache them?

Doesn't look like it.

It looks like it's a slow kernel in where_u8_f32

It comes from this part of the code (specifically, where_cond)

fn masked_fill(on_false: &Tensor, mask: &Tensor, on_true: &Tensor) -> Result<Tensor> {
    let shape = mask.shape();
    let m = mask.where_cond(&on_true.broadcast_as(shape.dims())?, on_false)?;
    Ok(m)
}

You can see in the picture I highlighted the second attention layer, there's no htod in the bottom like in the first layer however it's still the slowest kernel of the attention mechanism.

lucasavila00 · 2024-04-29T02:30:35Z

It could also be that llama.cpp does not use attention mask in the benchmark.

I can't find the timings for attention mask in the profile.

Since we use 10% of the time for attention mask, this is precisely the 35ms out of 385ms that differs from llama.cpp

EricLBuehler · 2024-04-29T02:39:13Z

Ok. Do you think we can find a way to disable this elegantly?

lucasavila00 · 2024-04-29T04:17:34Z

I disabled attention mask for mistralrs-bench here 97c0324

It used the builders so it did not require a public API change.

I also re-ran the profiles with my GPU at the same temps, and with the latest commit I see:

llama.cpp 320ms
mistral.rs 340ms

I'm pretty sure from the profiles llama.cpp is not doing masking indeed.

There's no big difference now, but I can see improvements we could make:

fuse copy + tof16 into copy_to_f16 in repeat_kv
fuse the affine division of (self.head_dim as f64).sqrt() into the matrix multiplication (candle doesn't expose it but the underlying kernel has this parameter)

The affine division above is surprisingly one of the most expensive operations of the attention mechanism. I wonder if candle can optimize that...

Also, it looks like llama.cpp copies data and convert dtypes faster than candle.

EricLBuehler · 2024-05-15T18:20:00Z

I think we could close this now, after the merge of #238, but please feel free to reopen. Thank you for your help, I really appreciate it. I will be looking into fusing the affine division.

lucasavila00 mentioned this issue Apr 16, 2024

Optimize quantized masked fill #162

Merged

lucasavila00 changed the title ~~Quantized Mistral: Prompt processing at 50% of llama.cpp speed~~ Quantized Mistral: Prompt processing slower than llama.cpp Apr 16, 2024

lucasavila00 mentioned this issue Apr 17, 2024

mistralrs-bench #166

Merged

EricLBuehler mentioned this issue Apr 28, 2024

Adding direct-F16 quantization huggingface/candle#2136

Closed

lucasavila00 mentioned this issue Apr 28, 2024

Candle won't use half-gemm from cublas when doing fp16 matmul huggingface/candle#2139

Closed

EricLBuehler closed this as completed May 15, 2024

polarathene mentioned this issue May 19, 2024

bug: If device layers requested exceed model layers, host layers overflow #329

Open

Quantized Mistral: Prompt processing slower than llama.cpp #153

Quantized Mistral: Prompt processing slower than llama.cpp #153

Comments

lucasavila00 commented Apr 16, 2024 • edited

lucasavila00 commented Apr 27, 2024

EricLBuehler commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

EricLBuehler commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024 • edited

EricLBuehler commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

EricLBuehler commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024

lucasavila00 commented Apr 27, 2024 • edited

lucasavila00 commented Apr 28, 2024 • edited

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 • edited

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 • edited

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 • edited

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 29, 2024 • edited

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024 • edited

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024 • edited

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024 • edited

EricLBuehler commented May 15, 2024

lucasavila00 commented Apr 16, 2024 •

edited

lucasavila00 commented Apr 27, 2024 •

edited

lucasavila00 commented Apr 27, 2024 •

edited

lucasavila00 commented Apr 28, 2024 •

edited

lucasavila00 commented Apr 28, 2024 •

edited

lucasavila00 commented Apr 28, 2024 •

edited

lucasavila00 commented Apr 28, 2024 •

edited

lucasavila00 commented Apr 29, 2024 •

edited

lucasavila00 commented Apr 29, 2024 •

edited

lucasavila00 commented Apr 29, 2024 •

edited

lucasavila00 commented Apr 29, 2024 •

edited