Am I being limited by single core CPU performance when fully offloaded? #5803

reversebias · 2024-02-29T23:02:04Z

reversebias
Feb 29, 2024

I'm using a system with the following hardware:

Xeon W-2133
96GB DDR4-2666
2x Nvidia P40 each in a 16x 3.0 slot

Running a Q6 quant of mixtral with:
./main -m ~/models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 32768 -ngl 128 -ts 39,61,0 -sm row -t 8 --prompt "Once upon a time"

And getting very decent 21-22 t/s:
llama_print_timings: sample time = 155.18 ms / 352 runs ( 0.44 ms per token, 2268.29 tokens per second) llama_print_timings: prompt eval time = 168.00 ms / 5 tokens ( 33.60 ms per token, 29.76 tokens per second) llama_print_timings: eval time = 16486.27 ms / 351 runs ( 46.97 ms per token, 21.29 tokens per second) llama_print_timings: total time = 16912.39 ms / 356 tokens

However I noticed that during generation, there's a single CPU thread pegged at 100%:

Single threaded is expected here, from looking at commits like #5238

But with the single thread at 100%, and nvidia-smi showing 50-60% utilization:

Could my single thread performance be my bottleneck?

reversebias · 2024-03-01T01:53:48Z

reversebias
Mar 1, 2024
Author

Found a response here that suggests it's spinning a thread and not doing anything, so not the bottleneck: #3210 (comment)

1 reply

ggerganov Mar 1, 2024
Maintainer

The thread spinning referenced in the comment is no longer occurring with latest ggml-backend updates

ExtReMLapin · 2024-03-06T12:42:03Z

ExtReMLapin
Mar 6, 2024

I feel like i'm having the same issue

0 replies

Artefact2 · 2024-03-09T14:41:47Z

Artefact2
Mar 9, 2024
Collaborator

Similar behaviour here on AMD/rocBLAS.

Most of the CPU time is spent in rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t). Needless to say, this makes the CPU generate a lot of heat and consume extra power for nothing.

0 replies

moritzschaefer · 2024-03-10T16:31:02Z

moritzschaefer
Mar 10, 2024

I would have suspected that more CPUs should at least benefit initial tokenization (is that the load part of the timings?).

0 replies

jwhitehorn · 2024-04-30T01:00:52Z

jwhitehorn
Apr 30, 2024

I too am running into this behavior. Single CPU thread at 100%, and GPU under-utilized (about 20% utilization).

Not sure if it matters, but here are some details:

Debian 12 / 6.8.4 host
Dual Xeon E5 2697v2 CPUs
64GB ECC RAM (Quad-channel DDR3-1333)
Intel Arc A770 GPU
Llama.cpp (via llama-cpp-python 0.2.65) dockerized using the intel/oneapi-basekit:2024.1.0-devel-ubuntu22.04 image
I've tried different models (llama 2, llama 3, claude 2, etc), all fully offloaded to VRAM

I've tried using llama.cpp's main and server executables directly and can confirm that the fact that I'm running through Python seems immaterial to the issue.

If the CPU thread isn't the bottleneck, do we know why the GPU isn't more utilized?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am I being limited by single core CPU performance when fully offloaded? #5803

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Am I being limited by single core CPU performance when fully offloaded? #5803

reversebias Feb 29, 2024

Replies: 5 comments · 1 reply

reversebias Mar 1, 2024 Author

ggerganov Mar 1, 2024 Maintainer

ExtReMLapin Mar 6, 2024

Artefact2 Mar 9, 2024 Collaborator

moritzschaefer Mar 10, 2024

jwhitehorn Apr 30, 2024

reversebias
Feb 29, 2024

Replies: 5 comments 1 reply

reversebias
Mar 1, 2024
Author

ggerganov Mar 1, 2024
Maintainer

ExtReMLapin
Mar 6, 2024

Artefact2
Mar 9, 2024
Collaborator

moritzschaefer
Mar 10, 2024

jwhitehorn
Apr 30, 2024