Inference bottleneck #248

wonkyoc · 2024-05-03T21:15:18Z

What I have experienced is that the inference of cpp on cpu is way too slow compared to the latest diffusers. Especially, only the sampling in UNet takes about 30s, which is approx. 23x slower than diffusers.

Results: just a single step w/ DPM++ 24 threads
stable-diffusion.cpp takes ~~32.95~~ vs. diffusers 1.43

I want to discuss this in detail. I saw the author's comment:

The current implementation of ggml_conv_2d is slow and has high memory usage

If this is true, does the slow inference root from ggml? the CPU seems to be well-utilized in matrix multiplication. I have compared a single thread vs. 2/4/12/24 threads and saw the MUL_MAT scales but the MUL_MAT itself is inherently slow. This is quite abnormal to me because what I have in mind is that cpp is supposed to be faster than python in nature.

My testing environment:

$ lscpu
CPU: AMD Ryzen Threadripper 3960X 24-Core Processor (Hyperthreading off)

Update
I found that w/o -O3 compilation would lead to slow inference. The default compilation of stable-diffusion.cpp is to have -O3 but somehow I excluded the flag when I tested. The updated result is around 5 sec but still slower than diffusers.

The text was updated successfully, but these errors were encountered:

FSSRepo · 2024-05-04T01:21:49Z

The truth is that, yes, the CPU backend isn't as optimized as it could be; perhaps it's the im2col kernel since it overuses memory accesses. In all ML software, the main bottleneck is always matrix multiplications.

Diffusers and other inference engines have many years of advancement over ggml, optimized code in every way. Performance improvements in ML are largely directed towards enhancing PyTorch library.

SA-j00u · 2024-05-04T02:14:51Z

where is this main bottleneck cpu code?
in ggml\src\ggml.c ?

ring-c · 2024-05-04T07:51:16Z

Was thinking to create a similar issue, hope my cuda case not an offtop.
On txt2img on cuda, there seems like CPU bottleneck. This looks like 100% load on one CPU core, and 70-80% load on GPU. I was thinking it must be almost no load on CPU if we offloading to GPU? I tried to see what happening by myself but you know ... math.

My hardware:
Intel i9-13900K
NVIDIA GeForce RTX 4080

wonkyoc · 2024-05-05T18:04:36Z

@SA-j00u The bottleneck mainly comes from the MUL_MAT operator. You can profile your run with GGML_PERF

@ring-c If you are using CUDA, that is the normal behavior. If you want to offload your execution to CPU, you will experience communication latency via PCI, which is normally bad unless you need an extra memory due to a large model size.

ring-c · 2024-05-05T18:26:29Z

@wonkyoc dont think this is normal. There like 50% free vram in this scenario. There is not enough load on GPU, CPU calculations not fast enough to load GPU.

wonkyoc · 2024-05-06T19:11:51Z

@FSSRepo I understand the repo is a bit infant compared to Pytorch and therefore, the slow inference is because of under-optimization. Yet, the one thing that I do not understand is llama.cpp/whisper.cpp are quite comparable (or even better than) to Pytorch. This tells me ggml might not be the main cause, but the stable diffusion's distinctive inference style might be.

Anyway, I will leave this issue for a bit and close anytime this week since this won't be solved soon.

wonkyoc · 2024-05-10T17:36:02Z

I found that this issue was actually created from my end although diffusers is still better. For some reason, I used CMAKE_BUILD_TYPE=Debug for the build and this took out -O3 flag from compilation so that's why I see such a high inference time. With the flag, the CPU takes roughly 5s per step instead of 30s.

SA-j00u · 2024-05-10T21:15:47Z

yep debug version is sloooooooow

JohnAlcatraz · 2024-05-12T03:05:44Z

But you say there is still a difference of ~5 seconds here vs 1.43 seconds in diffusers?

I think in that case, that is a very interesting benchmark and it would be good to keep this issue open to track the speed difference and give some visibility to possible improvements. Maybe just update the first post with some accurate benchmarks based on a release build. Or create a new issue with the accurate results.

SA-j00u · 2024-05-12T08:30:21Z

i think some bottleneck is C compiler (that not as active mainterlaced as c++ compilers)
execution time on some my code (with nested fors) compiled with mingw
0.244521 cpp compiler
0.116364 cpp compiler all variables register
0.290956 c compiler
0.230809 c compiler all variables register

i replace all possible variables types to register in ggml.c
but doesn't get any speedup
even i got same binary as without register vars

also -OFast produce different results compared to -O3 !!!!
but doesn't give significantly/noticeable speedup

wonkyoc · 2024-05-13T00:00:08Z

@JohnAlcatraz I just updated the first post and am reopening the issue.

wonkyoc closed this as completed May 10, 2024

wonkyoc reopened this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference bottleneck #248

Inference bottleneck #248

wonkyoc commented May 3, 2024 •

edited

FSSRepo commented May 4, 2024 •

edited

SA-j00u commented May 4, 2024 •

edited

ring-c commented May 4, 2024

wonkyoc commented May 5, 2024

ring-c commented May 5, 2024

wonkyoc commented May 6, 2024

wonkyoc commented May 10, 2024

SA-j00u commented May 10, 2024

JohnAlcatraz commented May 12, 2024 •

edited

SA-j00u commented May 12, 2024 •

edited

wonkyoc commented May 13, 2024

Inference bottleneck #248

Inference bottleneck #248

Comments

wonkyoc commented May 3, 2024 • edited

FSSRepo commented May 4, 2024 • edited

SA-j00u commented May 4, 2024 • edited

ring-c commented May 4, 2024

wonkyoc commented May 5, 2024

ring-c commented May 5, 2024

wonkyoc commented May 6, 2024

wonkyoc commented May 10, 2024

SA-j00u commented May 10, 2024

JohnAlcatraz commented May 12, 2024 • edited

SA-j00u commented May 12, 2024 • edited

wonkyoc commented May 13, 2024

wonkyoc commented May 3, 2024 •

edited

FSSRepo commented May 4, 2024 •

edited

SA-j00u commented May 4, 2024 •

edited

JohnAlcatraz commented May 12, 2024 •

edited

SA-j00u commented May 12, 2024 •

edited