Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding direct-F16 quantization #2136

Closed
EricLBuehler opened this issue Apr 28, 2024 · 9 comments
Closed

Adding direct-F16 quantization #2136

EricLBuehler opened this issue Apr 28, 2024 · 9 comments

Comments

@EricLBuehler
Copy link
Contributor

Hello all,

During our work on mistral.rs we have noticed that Candle only dequantizes to F32 whereas llama.cpp can dequantize to F16. This affects performance because on certain hardware, turing will be used over the slower volta matmul kernels when in F16. Are there any plans to add support for dequantizing to arbitrary floating point datatypes in the future?

For reference, here is our tracking issue: EricLBuehler/mistral.rs#153

Thank you!

@lucasavila00
Copy link
Contributor

lucasavila00 commented Apr 28, 2024

Some extra context (the numbers are of an RTX 2070)

A prompt of 512 tokens it processed at ~600t/s using the MMQ kernels.

If I force it to dequantize first, convert it to f16, then do the matmuls in f16, then convert it to f32 I can get candle to use the same kernels llama.cpp uses for prompt processing (I think? The names are almost the same). This runs at ~700t/s.

On this latter approach, 25% of the GPU time is spent doing f32 -> f16 conversion. Ideally we'd dequantize directly to f16 to reduce some of that workload.

This PR EricLBuehler/mistral.rs#238 implements what I described above, and it contains comparisons between llama-bench and mistralrs-bench, and nvidia profiles of both applications.

These lines of llama.cpp do the same f32->f16 and matmul https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1232-L1270 that is called from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L1959

@LaurentMazare
Copy link
Collaborator

That sounds like some pretty neat speedup to get. Is it just useful for cuda or also for cpu/metal?

@EricLBuehler
Copy link
Contributor Author

I think this would be an optimization for CUDA.

@LaurentMazare
Copy link
Collaborator

Ok thanks, let me have a quick look I don't think that the kernels do any float specific magic so the conversion shouldn't be tricky.

@LaurentMazare
Copy link
Collaborator

See #2137 , I'm just going to add a bit of testing but this should be hopefully all fine.

@LaurentMazare
Copy link
Collaborator

#2137 has been merged, I'll also put some small changes so that it's easier to control which version gets used in #2138 .

@lucasavila00
Copy link
Contributor

After direct f16 dequantization we're at 1000t/s EricLBuehler/mistral.rs#238 (comment)

Thank you!

@EricLBuehler
Copy link
Contributor Author

@LaurentMazare, thank you for adding this! We observe about a 60% performance increase for prompt processing.

It seems like the Candle matmul kernels here are slower than the llama.cpp ones overall, though by about 60%, which correlates with our prompt processing deficit to llama.cpp of also about 60%.

@lucasavila00
Copy link
Contributor

I created a new issue about the different kernels #2139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants