Presentation on llama.cpp on 22.02.24 at LIPS #5551
Replies: 3 comments 2 replies
-
Thanks for sharing your WIP slides. I can give you some feedback for them, although you don't have much time left before presenting them in a few days. The conference seems to focus on applied LLM usage. Your presentation is on the layer below that. It looks to me like you already worked on making concepts and text easy to understand for a more general audience. On slide 2 there is CUDA. A few people in the audience might not have been exposed to that. Adding something like "parallel processing on NVidia devices" to that would help them. Slide 4: Weight Quantization Slide 6: Matrix shape matters Slide 12: Problems With Quantized Blocks Slide 14: llama.cpp int8 Tensor Cores There seems to be some topic-mixing going on, and I'm not sure that the audience will easily see one of the main achievements here. The previous slides were all about speed. Then suddenly numerical precision becomes important, and the slides go back to a speed comparison before finally highlighting precision again. Maybe inserting an additional diagram around slide 12 would help here, to illustrate the mentioned small & big blocks performance, and another one for the PPL or KV comparison of the results. Based on that you can make it clear that the target is big block performance with small block PPL. The solution is then built on slides 13 & 14. Afterwards you can show the achievement, performance vs. PPL of the methods again. Slide 20: Appendix: Memory Scaling Slides 20 and 21 remind me that there were multiple changes of the yielding code that had an impact on the performance. Depending on the CPU and yielding code there was on average 10% better memory and thread scaling, if I remember correctly. There is probably not enough time to experiment with that, just for maybe getting a slightly nicer graph for the presentation. In general there is a strong contrast between the bullet point text size and graph text size on most of the slides. If the slide text needs to be that big for the audience to read it, then the audience will be unable to read the diagrams. Assuming that the audience will be perfectly capable of reading the diagram text: You could reduce the size of the slide text a bit and increase the spacing between the individual bullet points. This makes them easier to read and gives less of an impression that they're killing the graph text. |
Beta Was this translation helpful? Give feedback.
-
The graph on slide 17 looks like a dupe of 7. Not sure if its supposed to be the same or show something different. |
Beta Was this translation helpful? Give feedback.
-
This is so useful, I already know that llama.cpp can do multiplications on some quantized values without dequantizing them, but can't really understand how it's done. It's now clear thanks to your presentation. Thank you! |
Beta Was this translation helpful? Give feedback.
-
I will hold a presentation on llama.cpp at the First Large Language Models in Physics Symposium on the 22. of February. The title of the presentation is "Efficient Matrix Multiplication Algorithms for Quantized Language Models" with the following abstract:
As the abstract implies, the talk will be about my work on llama.cpp regarding performance optimizations for matrix multiplication algorithms. Registration for in-person attendance is already closed but it is still possible to register for attendance via Zoom if someone wants to.
This is the first draft of the slides that I will be using:
Efficient_Mat_Mul.pdf
Edit: second draft:
Efficient_Mat_Mul.pdf
Beta Was this translation helpful? Give feedback.
All reactions