Am I being limited by single core CPU performance when fully offloaded? #5803
Replies: 5 comments 1 reply
-
Found a response here that suggests it's spinning a thread and not doing anything, so not the bottleneck: #3210 (comment) |
Beta Was this translation helpful? Give feedback.
-
I feel like i'm having the same issue |
Beta Was this translation helpful? Give feedback.
-
Similar behaviour here on AMD/rocBLAS. Most of the CPU time is spent in |
Beta Was this translation helpful? Give feedback.
-
I would have suspected that more CPUs should at least benefit initial tokenization (is that the |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I'm using a system with the following hardware:
Running a Q6 quant of mixtral with:
./main -m ~/models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 32768 -ngl 128 -ts 39,61,0 -sm row -t 8 --prompt "Once upon a time"
And getting very decent 21-22 t/s:
llama_print_timings: sample time = 155.18 ms / 352 runs ( 0.44 ms per token, 2268.29 tokens per second) llama_print_timings: prompt eval time = 168.00 ms / 5 tokens ( 33.60 ms per token, 29.76 tokens per second) llama_print_timings: eval time = 16486.27 ms / 351 runs ( 46.97 ms per token, 21.29 tokens per second) llama_print_timings: total time = 16912.39 ms / 356 tokens
However I noticed that during generation, there's a single CPU thread pegged at 100%:
Single threaded is expected here, from looking at commits like #5238
But with the single thread at 100%, and nvidia-smi showing 50-60% utilization:
Could my single thread performance be my bottleneck?
Beta Was this translation helpful? Give feedback.
All reactions