Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized: Use cublas for prompt #238

Merged
merged 15 commits into from May 15, 2024

Conversation

lucasavila00
Copy link
Contributor

@lucasavila00 lucasavila00 commented Apr 28, 2024

This

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m The
Bloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T05:58:00.751771Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T05:58:00.751790Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T05:58:00.751793Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T05:58:00.751810Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T05:58:02.469281Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T05:58:02.499735Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   57.762±0.583 | 17.314±0.176 |           1 |    57.762444 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 719.654±10.389 |  1.390±0.020 |           1 |     719.6544 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  46.236±1.463 | 21.650±0.692 |           2 |      92.4723 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 458.195±3.664 |  2.183±0.017 |           2 |    916.38995 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.450±0.103 | 33.956±0.118 |           4 |    117.80009 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 260.459±1.187 |  3.839±0.018 |           4 |    1041.8367 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Master

$ ./target/profiling/mistralrs-bench -r 5 -c 1,2,4  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
2024-04-28T06:00:24.120633Z  INFO mistralrs_bench: avx: true, neon: false, simd128: false, f16c: true
2024-04-28T06:00:24.120654Z  INFO mistralrs_bench: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-28T06:00:24.120657Z  INFO mistralrs_bench: Loading model `mistralai/Mistral-7B-Instruct-v0.1` on Cuda(CudaDevice(DeviceId(1)))...
2024-04-28T06:00:24.120671Z  INFO mistralrs_bench: Model kind is: quantized from gguf (no adapters)
2024-04-28T06:00:25.850501Z  INFO mistralrs_core::pipeline::chat_template: bos_tok = <s>, eos_tok = ["</s>"], unk_tok = <unk>
2024-04-28T06:00:25.882085Z  INFO mistralrs_bench: Model loaded.
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |   58.091±0.919 | 17.219±0.274 |           1 |     58.09086 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 620.491±10.625 |  1.612±0.028 |           1 |     620.4911 |
+------------------------------------+---------+--------+----------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  47.455±0.311 | 21.073±0.138 |           2 |     94.91029 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 328.383±1.779 |  3.045±0.017 |           2 |     656.7665 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 |  29.232±0.055 | 34.209±0.064 |           4 |   116.927444 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 163.163±0.807 |  6.129±0.030 |           4 |     652.6506 |
+------------------------------------+---------+--------+---------------+--------------+-------------+--------------+

Llama.cpp

$ /home/lucas/oss/llama.cpp/llama-bench  -m /home/lucas/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q4_K_M.gguf -n 0 -p 512 -r 1 -b 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | CUDA       |  99 |        512 | pp 512     |   1747.07 ± 0.00 |

build: 7593639c (2679)

Honestly, I don't think it's worth it to merge it just for this small win. This breaks usage for GPUs that don't support F16...

But it's an improvement...

Copy link

github-actions bot commented Apr 28, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   16          419          378            1           40
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1026            0          758          268
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1537          474          767          296
-------------------------------------------------------------------------------
 Rust                   81        26376        24282          334         1760
 |- Markdown            38          359            0          354            5
 (Total)                          26735        24282          688         1765
===============================================================================
 Total                 143        29047        25685         1114         2248
===============================================================================
  

@lucasavila00
Copy link
Contributor Author

mistral.rs

image

llama.cpp

image

@lucasavila00 lucasavila00 changed the title Use cublas for prompt Quantized: Use cublas for prompt Apr 28, 2024
@lucasavila00
Copy link
Contributor Author

After direct f16 dequant:

This

+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s            | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 1016.721±6.399 | 0.984±0.006 |           1 |    1016.7206 |
+------------------------------------+---------+--------+----------------+-------------+-------------+--------------+

Master

+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| model                              | backend | test   | t/s           | ms/t        | concurrency | throughput/s |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | pp 512 | 614.365±2.955 | 1.628±0.008 |           1 |     614.3652 |
+------------------------------------+---------+--------+---------------+-------------+-------------+--------------+

image

@EricLBuehler
Copy link
Owner

@lucasavila00, that's amazing! I am looking forward to merging this.

Can you please add the special matmul function to the layer.rs file so we can use it in all models? Ideally, we can also implement it in QLinear so ISQ can benefit, too.

@lucasavila00
Copy link
Contributor Author

I need to integrate these candle changes huggingface/candle#2141

And this PR should be ready

I'll try to do it tonight

@lucasavila00
Copy link
Contributor Author

@EricLBuehler would you mind updating your fork? It does not include the precision changes

@EricLBuehler
Copy link
Owner

@lucasavila00 sure, I just updated it.

@lucasavila00 lucasavila00 marked this pull request as ready for review April 29, 2024 22:42
@lucasavila00
Copy link
Contributor Author

I did not implement it for the other models. I can try to do it later in another MR.

I don't have the bandwidth to work with different models and setups today.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, I'm looking forward to merging it. Once I do, I'll implement the F16 gemm for the rest of the models.

mistralrs-core/src/lib.rs Show resolved Hide resolved
None
} else {
Some(self.mask(seq_len, x.device())?)
};

let via_f16 = if is_prompt { seq_len > 32 } else { false };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let via_f16 = if is_prompt { seq_len > 32 } else { false };
let via_f16 = is_prompt && seq_len > 32;

@EricLBuehler
Copy link
Owner

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

@lucasavila00
Copy link
Contributor Author

Hi @lucasavila00, I'm looking forward to merging this PR!

I left one requested change, and there is a conflict which should be pretty easy to resolve. Please let me know if you need any help.

There were changes to the is_prompt bool I was using to decide which version to use. I changed it to a different heuristic but I'm not sure it's optimal.

I'm sorry but I don't have time to benchmark it soon.

@EricLBuehler
Copy link
Owner

Can we check if seq_len>1? I think that would be pretty reliable.

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good already. If I understand correctly, the Candle backend already doesn't support GPUs which don't support F16/BF16 as there is no gating on those dtypes, so I think this is good to merge. If in the future we decide to add more compat for those devices, it can be easily changed here.

I think this is ready to merge.

mistralrs-core/src/models/quantized_llama.rs Outdated Show resolved Hide resolved
@EricLBuehler EricLBuehler merged commit 2fcb106 into EricLBuehler:master May 15, 2024
11 checks passed
@EricLBuehler
Copy link
Owner

@lucasavila00 thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants