Quantized: Use cublas for prompt #238

lucasavila00 · 2024-04-28T06:03:15Z

This

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m The
Bloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T05:58:00.751771Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T05:58:00.751790Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T05:58:00.751793Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T05:58:00.751810Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T05:58:02.469281Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T05:58:02.499735Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   57.762±0.583 | 17.314±0.176 |           1 |    57.762444 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 719.654±10.389 |  1.390±0.020 |           1 |     719.6544 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  46.236±1.463 | 21.650±0.692 |           2 |      92.4723 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 458.195±3.664 |  2.183±0.017 |           2 |    916.38995 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.450±0.103 | 33.956±0.118 |           4 |    117.80009 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 260.459±1.187 |  3.839±0.018 |           4 |    1041.8367 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Master

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T06:00:24.120633Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T06:00:24.120654Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T06:00:24.120657Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T06:00:24.120671Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T06:00:25.850501Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T06:00:25.882085Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   58.091±0.919 | 17.219±0.274 |           1 |     58.09086 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 620.491±10.625 |  1.612±0.028 |           1 |     620.4911 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  47.455±0.311 | 21.073±0.138 |           2 |     94.91029 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 328.383±1.779 |  3.045±0.017 |           2 |     656.7665 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.232±0.055 | 34.209±0.064 |           4 |   116.927444 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 163.163±0.807 |  6.129±0.030 |           4 |     652.6506 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Llama.cpp

$ /home/lucas/oss/llama.cpp/llama-bench  -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 | pp 512     |   1747.07 ± 0.00 |

build: 7593639c (2679)

Honestly, I don't think it's worth it to merge it just for this small win. This breaks usage for GPUs that don't support F16...

But it's an improvement...

github-actions · 2024-04-28T06:03:32Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   16          419          378            1           40
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1026            0          758          268
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1537          474          767          296
-------------------------------------------------------------------------------
 Rust                   81        26376        24282          334         1760
 |- Markdown            38          359            0          354            5
 (Total)                          26735        24282          688         1765
===============================================================================
 Total                 143        29047        25685         1114         2248
===============================================================================

lucasavila00 · 2024-04-28T06:30:28Z

mistral.rs

llama.cpp

lucasavila00 · 2024-04-28T19:41:11Z

After direct f16 dequant:

This

+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 1016.721±6.399 | 0.984±0.006 |           1 |    1016.7206 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+

Master

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 614.365±2.955 | 1.628±0.008 |           1 |     614.3652 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

EricLBuehler · 2024-04-28T20:22:09Z

@lucasavila00, that's amazing! I am looking forward to merging this.

Can you please add the special matmul function to the layer.rs file so we can use it in all models? Ideally, we can also implement it in QLinear so ISQ can benefit, too.

lucasavila00 · 2024-04-29T14:05:36Z

I need to integrate these candle changes huggingface/candle#2141

And this PR should be ready

I'll try to do it tonight

lucasavila00 · 2024-04-29T21:39:55Z

@EricLBuehler would you mind updating your fork? It does not include the precision changes

EricLBuehler · 2024-04-29T21:58:45Z

@lucasavila00 sure, I just updated it.

lucasavila00 · 2024-04-29T22:45:04Z

I did not implement it for the other models. I can try to do it later in another MR.

I don't have the bandwidth to work with different models and setups today.

EricLBuehler

Thanks for implementing this, I'm looking forward to merging it. Once I do, I'll implement the F16 gemm for the rest of the models.

mistralrs-core/src/lib.rs

EricLBuehler · 2024-04-30T09:34:12Z

mistralrs-core/src/models/quantized_llama.rs

            None
        } else {
            Some(self.mask(seq_len, x.device())?)
        };
+
+        let via_f16 = if is_prompt { seq_len > 32 } else { false };


Suggested change

let via_f16 = if is_prompt { seq_len > 32 } else { false };

let via_f16 = is_prompt && seq_len > 32;

EricLBuehler · 2024-05-14T00:47:48Z

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

mistralrs-core/src/models/quantized_llama.rs

lucasavila00 · 2024-05-14T01:11:04Z

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

There were changes to the is_prompt bool I was using to decide which version to use. I changed it to a different heuristic but I'm not sure it's optimal.

I'm sorry but I don't have time to benchmark it soon.

EricLBuehler · 2024-05-14T12:17:09Z

Can we check if seq_len>1? I think that would be pretty reliable.

EricLBuehler

This looks good already. If I understand correctly, the Candle backend already doesn't support GPUs which don't support F16/BF16 as there is no gating on those dtypes, so I think this is good to merge. If in the future we decide to add more compat for those devices, it can be easily changed here.

I think this is ready to merge.

mistralrs-core/src/models/quantized_llama.rs

EricLBuehler · 2024-05-15T18:12:17Z

@lucasavila00 thank you!

Use cublas for prompt

58fe2bc

lucasavila00 mentioned this pull request Apr 28, 2024

Quantized Mistral: Prompt processing slower than llama.cpp #153

Closed

lucasavila00 changed the title ~~Use cublas for prompt~~ Quantized: Use cublas for prompt Apr 28, 2024

lucasavila00 mentioned this pull request Apr 28, 2024

Adding direct-F16 quantization huggingface/candle#2136

Closed

cublas prompt

62d5206

lucasavila00 added 2 commits April 28, 2024 22:35

use f16 for output mul

0347130

mulmat via f16

e069667

lucasavila00 force-pushed the cublas_prompt branch from 1e7f3c8 to e069667 Compare April 29, 2024 03:18

lucasavila00 added 2 commits April 29, 2024 00:56

disable attn mask in bench

97c0324

reduce contiguous calls

699e541

remove attn mask disabling

6388053

lucasavila00 added 3 commits April 29, 2024 19:37

reduced precision, refactor

0ca7255

Merge branch 'master' into cublas_prompt

a0d5f92

cliipy

4419381

lucasavila00 marked this pull request as ready for review April 29, 2024 22:42

EricLBuehler requested changes Apr 30, 2024

View reviewed changes

lucasavila00 added 2 commits May 13, 2024 22:05

merge master

f906dbc

changes

9d65c2d

lucasavila00 commented May 14, 2024

View reviewed changes

mistralrs-core/src/models/quantized_llama.rs Outdated Show resolved Hide resolved

EricLBuehler approved these changes May 14, 2024

View reviewed changes

mistralrs-core/src/models/quantized_llama.rs Outdated Show resolved Hide resolved

Update mistralrs-core/src/models/quantized_llama.rs

43ee0ad

EricLBuehler approved these changes May 14, 2024

View reviewed changes

mistralrs-core/src/models/quantized_llama.rs Outdated Show resolved Hide resolved

EricLBuehler added 2 commits May 14, 2024 08:38

Update mistralrs-core/src/models/quantized_llama.rs

64f656b

Merge branch 'master' into cublas_prompt

0979c5f

EricLBuehler merged commit 2fcb106 into EricLBuehler:master May 15, 2024
11 checks passed

EricLBuehler mentioned this pull request May 15, 2024

Matmul via f16 when possible #317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized: Use cublas for prompt #238

Quantized: Use cublas for prompt #238

lucasavila00 commented Apr 28, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited

lucasavila00 commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler left a comment

EricLBuehler Apr 30, 2024

EricLBuehler commented May 14, 2024

lucasavila00 commented May 14, 2024

EricLBuehler commented May 14, 2024

EricLBuehler left a comment

EricLBuehler commented May 15, 2024

	let via_f16 = if is_prompt { seq_len > 32 } else { false };
	let via_f16 = is_prompt && seq_len > 32;

Quantized: Use cublas for prompt #238

Quantized: Use cublas for prompt #238

Conversation

lucasavila00 commented Apr 28, 2024 • edited

github-actions bot commented Apr 28, 2024 • edited

lucasavila00 commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler commented Apr 29, 2024

lucasavila00 commented Apr 29, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler Apr 30, 2024

Choose a reason for hiding this comment

EricLBuehler commented May 14, 2024

lucasavila00 commented May 14, 2024

EricLBuehler commented May 14, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented May 15, 2024

lucasavila00 commented Apr 28, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited