Presentation on llama.cpp on 22.02.24 at LIPS #5551

JohannesGaessler · 2024-02-17T18:47:42Z

JohannesGaessler
Feb 17, 2024
Collaborator

I will hold a presentation on llama.cpp at the First Large Language Models in Physics Symposium on the 22. of February. The title of the presentation is "Efficient Matrix Multiplication Algorithms for Quantized Language Models" with the following abstract:

Large language models have - as the name implies - large numbers of parameters. As such not only
the training costs but also the inference costs of these models are quite substantial. One strategy
for reducing inference costs is to quantize the model weights from 16 bit floating point values to
a format with 2-8 bits per weight. However, these custom data formats in turn require custom
inference code. This talk describes the interplay of llama.cpp quantization formats and inference
code and how int8 tensor cores or integer intrinsics can be used to reach performance exceeding
that of standard floating point GEMM routines provided by e.g. cuBLAS

As the abstract implies, the talk will be about my work on llama.cpp regarding performance optimizations for matrix multiplication algorithms. Registration for in-person attendance is already closed but it is still possible to register for attendance via Zoom if someone wants to.

This is the first draft of the slides that I will be using:

Efficient_Mat_Mul.pdf

Edit: second draft:

Efficient_Mat_Mul.pdf

BrickBee · 2024-02-18T09:26:39Z

BrickBee
Feb 18, 2024

Thanks for sharing your WIP slides. I can give you some feedback for them, although you don't have much time left before presenting them in a few days.

The conference seems to focus on applied LLM usage. Your presentation is on the layer below that. It looks to me like you already worked on making concepts and text easy to understand for a more general audience.

On slide 2 there is CUDA. A few people in the audience might not have been exposed to that. Adding something like "parallel processing on NVidia devices" to that would help them.

Slide 4: Weight Quantization
The "why" is important yet missing here. VRAM is limited, especially on end-user devices. It is thus important to reduce memory usage. Maybe this could even go on slide 3 where quantization is first mentioned.

Slide 6: Matrix shape matters
This would benefit from being illustrated with 2 or 3 tiny squares. Not everyone is familiar with the big-O notation and matrix operations.
The slide contains "Decoding batch size 1, I/O bound", yet on slide 5 there is a diagram for "decoding 8 bit half as fast as FP16" that shows that higher batch sizes lead to faster decoding. This should be clarified.

Slide 12: Problems With Quantized Blocks
Highlight that new flowchart shows the same flow as before, just with extra steps at top/bottom. That makes the chart faster to parse for the audience and gives them more time to listen to you.

Slide 14: llama.cpp int8 Tensor Cores
The audience can't be expected to remember the flowchart from two or three slides ago. Having the new one side-by-side with the previous one can help here. The consistent coloring of text and flowchart elements is useful. Maybe you can do some more with color or shapes here, to make it easier to keep track of the constant parts of the flows.

There seems to be some topic-mixing going on, and I'm not sure that the audience will easily see one of the main achievements here. The previous slides were all about speed. Then suddenly numerical precision becomes important, and the slides go back to a speed comparison before finally highlighting precision again.

Maybe inserting an additional diagram around slide 12 would help here, to illustrate the mentioned small & big blocks performance, and another one for the PPL or KV comparison of the results. Based on that you can make it clear that the target is big block performance with small block PPL. The solution is then built on slides 13 & 14. Afterwards you can show the achievement, performance vs. PPL of the methods again.

Slide 20: Appendix: Memory Scaling
This might be seen as pedantic, but here it goes, as it's unfortunately a very common misconception.
The X axis is labeled as "Dual channel memory speed MHz". That's not correct, it should be MT/s instead of MHz.
The Ryzen 3700X is used with DDR4 RAM. DDR4 clock speeds are roughly 800 to 1600 MHz. DDR RAM stands for "Double Data Rate", where you get 2 bits per clock and not just one bit like with the older SDR RAM.
Many sources just multiply the MHz by two to get the "effective MHz". That's what the MT/s is for: Indicate the transfer speed without relying on the MHz.

Slides 20 and 21 remind me that there were multiple changes of the yielding code that had an impact on the performance. Depending on the CPU and yielding code there was on average 10% better memory and thread scaling, if I remember correctly. There is probably not enough time to experiment with that, just for maybe getting a slightly nicer graph for the presentation.

In general there is a strong contrast between the bullet point text size and graph text size on most of the slides. If the slide text needs to be that big for the audience to read it, then the audience will be unable to read the diagrams. Assuming that the audience will be perfectly capable of reading the diagram text: You could reduce the size of the slide text a bit and increase the spacing between the individual bullet points. This makes them easier to read and gives less of an impression that they're killing the graph text.

1 reply

JohannesGaessler Feb 18, 2024
Collaborator Author

Thank you for the feedback. I definitely agree with a lot of what you're saying and it's very helpful to get more input regarding which parts of the presentation need more work/time.

The X axis is labeled as "Dual channel memory speed MHz". That's not correct, it should be MT/s instead of MHz.

I always feel conflicted about this. I agree that MT/s is the better term in a vacuum but if you buy basically any RAM it says "MHz" on the box. So I feel like even though it is a worse term "MHz" is less confusing for people that are not as knowledgeable about hardware.

Green-Sky · 2024-02-19T10:31:44Z

Green-Sky
Feb 19, 2024
Collaborator

The graph on slide 17 looks like a dupe of 7. Not sure if its supposed to be the same or show something different.

1 reply

JohannesGaessler Feb 19, 2024
Collaborator Author

Thanks.

ngxson · 2024-03-15T09:27:49Z

ngxson
Mar 15, 2024
Collaborator

This is so useful, I already know that llama.cpp can do multiplications on some quantized values without dequantizing them, but can't really understand how it's done. It's now clear thanks to your presentation. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presentation on llama.cpp on 22.02.24 at LIPS #5551

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Presentation on llama.cpp on 22.02.24 at LIPS #5551

JohannesGaessler Feb 17, 2024 Collaborator

Replies: 3 comments · 2 replies

BrickBee Feb 18, 2024

JohannesGaessler Feb 18, 2024 Collaborator Author

Green-Sky Feb 19, 2024 Collaborator

JohannesGaessler Feb 19, 2024 Collaborator Author

ngxson Mar 15, 2024 Collaborator

JohannesGaessler
Feb 17, 2024
Collaborator

Replies: 3 comments 2 replies

BrickBee
Feb 18, 2024

JohannesGaessler Feb 18, 2024
Collaborator Author

Green-Sky
Feb 19, 2024
Collaborator

JohannesGaessler Feb 19, 2024
Collaborator Author

ngxson
Mar 15, 2024
Collaborator