Adding direct-F16 quantization #2136

EricLBuehler · 2024-04-28T09:15:17Z

Hello all,

During our work on mistral.rs we have noticed that Candle only dequantizes to F32 whereas llama.cpp can dequantize to F16. This affects performance because on certain hardware, turing will be used over the slower volta matmul kernels when in F16. Are there any plans to add support for dequantizing to arbitrary floating point datatypes in the future?

For reference, here is our tracking issue: EricLBuehler/mistral.rs#153

Thank you!

The text was updated successfully, but these errors were encountered:

lucasavila00 · 2024-04-28T15:40:58Z

Some extra context (the numbers are of an RTX 2070)

A prompt of 512 tokens it processed at ~600t/s using the MMQ kernels.

If I force it to dequantize first, convert it to f16, then do the matmuls in f16, then convert it to f32 I can get candle to use the same kernels llama.cpp uses for prompt processing (I think? The names are almost the same). This runs at ~700t/s.

On this latter approach, 25% of the GPU time is spent doing f32 -> f16 conversion. Ideally we'd dequantize directly to f16 to reduce some of that workload.

This PR EricLBuehler/mistral.rs#238 implements what I described above, and it contains comparisons between llama-bench and mistralrs-bench, and nvidia profiles of both applications.

These lines of llama.cpp do the same f32->f16 and matmul https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1232-L1270 that is called from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1959

LaurentMazare · 2024-04-28T16:18:36Z

That sounds like some pretty neat speedup to get. Is it just useful for cuda or also for cpu/metal?

EricLBuehler · 2024-04-28T16:36:33Z

I think this would be an optimization for CUDA.

LaurentMazare · 2024-04-28T16:50:58Z

Ok thanks, let me have a quick look I don't think that the kernels do any float specific magic so the conversion shouldn't be tricky.

LaurentMazare · 2024-04-28T17:16:52Z

See #2137 , I'm just going to add a bit of testing but this should be hopefully all fine.

LaurentMazare · 2024-04-28T18:32:56Z

#2137 has been merged, I'll also put some small changes so that it's easier to control which version gets used in #2138 .

lucasavila00 · 2024-04-28T20:24:39Z

After direct f16 dequantization we're at 1000t/s EricLBuehler/mistral.rs#238 (comment)

Thank you!

EricLBuehler · 2024-04-28T20:26:57Z

@LaurentMazare, thank you for adding this! We observe about a 60% performance increase for prompt processing.

It seems like the Candle matmul kernels here are slower than the llama.cpp ones overall, though by about 60%, which correlates with our prompt processing deficit to llama.cpp of also about 60%.

lucasavila00 · 2024-04-28T20:29:48Z

I created a new issue about the different kernels #2139

lucasavila00 mentioned this issue Apr 28, 2024

Candle won't use half-gemm from cublas when doing fp16 matmul #2139

Closed

EricLBuehler closed this as completed Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding direct-F16 quantization #2136

Adding direct-F16 quantization #2136

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 •

edited

LaurentMazare commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

Adding direct-F16 quantization #2136

Adding direct-F16 quantization #2136

Comments

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 • edited

LaurentMazare commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

LaurentMazare commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024 •

edited