Optimize quantized masked fill #162

lucasavila00 · 2024-04-16T22:24:39Z

Closes #161

This MR:

Prompt T/s = 70.70707
Completion T/s = 61.244022

Master:

Prompt T/s = 59.82906
Completion T/s = 61.538464

github-actions · 2024-04-16T22:25:00Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        60     19995     1439       821    17735       1128
───────────────────────────────────────────────────────────────────────────────
Total                       60     19995     1439       821    17735       1128
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 53,182
Estimated Schedule Effort 10.982569 months
Estimated People Required 4.474866
───────────────────────────────────────────────────────────────────────────────
Processed 676190 bytes, 0.676 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

lucasavila00 · 2024-04-16T22:28:52Z

It reduces completion speed too much.

I wonder if that's because of bad measurements or real. I'm profiling it.

lucasavila00 · 2024-04-16T22:48:17Z

It was a measurement issue, fixed by #163

EricLBuehler · 2024-04-16T23:09:19Z

@lucasavila00, I just merged #163. Can you please update the benchmarks?

lucasavila00 · 2024-04-16T23:14:17Z

@EricLBuehler I just did.

I'm having a hard time measuring small changes, significantly.

I ran it a bunch of times and completion speed is unchanged. Prompt speed improved.

I think we need something like https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md instead of the --prompt setup to benchmark.

I'm also using my local GPU, with bad cooling and other processes using it etc

The llama-bench setup runs a but of repetitions to remove this noise

I created an issue for it #164

EricLBuehler · 2024-04-16T23:45:57Z

That is a great idea, I think that we should add that in light of future improvements. I'll merge this as I think the performance gains are very significant!

Optimize quantized masked fill

ee07106

lucasavila00 marked this pull request as draft April 16, 2024 22:28

lucasavila00 marked this pull request as ready for review April 16, 2024 22:48

lucasavila00 mentioned this pull request Apr 16, 2024

Measure prompt time after sampling #163

Merged

lucasavila00 marked this pull request as draft April 16, 2024 23:07

EricLBuehler added the optimization label Apr 16, 2024

lucasavila00 marked this pull request as ready for review April 16, 2024 23:12

lucasavila00 mentioned this pull request Apr 16, 2024

Infra: Create mistralrs-bench #164

Closed

EricLBuehler merged commit e309168 into EricLBuehler:master Apr 16, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize quantized masked fill #162

Optimize quantized masked fill #162

lucasavila00 commented Apr 16, 2024 •

edited

github-actions bot commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024

EricLBuehler commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024 •

edited

EricLBuehler commented Apr 16, 2024

Optimize quantized masked fill #162

Optimize quantized masked fill #162

Conversation

lucasavila00 commented Apr 16, 2024 • edited

github-actions bot commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024

EricLBuehler commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024 • edited

EricLBuehler commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024 •

edited

lucasavila00 commented Apr 16, 2024 •

edited