llama3 generation time (GPU) #6878

gopalgk · 2024-04-24T15:26:25Z

gopalgk
Apr 24, 2024

Hello all,

I am running llama3-8B-instruct Q6_K model on dual GV100 GPUs. I get very inconsistent timing results while using the model. Sometimes I get 50 tps , sometimes 5 tps. Also notice the total time in the slower case is multiple times the sum of the components.

What could be the cause of this unreliable behavior?

Thanks.

llama_print_timings:        load time =     252.41 ms
llama_print_timings:      sample time =      66.07 ms /   110 runs   (    0.60 ms per token,  1665.00 tokens per second)
llama_print_timings: prompt eval time =     252.26 ms /    49 tokens (    5.15 ms per token,   194.24 tokens per second)
llama_print_timings:        eval time =    2406.62 ms /   109 runs   (   22.08 ms per token,    45.29 tokens per second)
llama_print_timings:       total time =    4320.05 ms /   158 tokens

llama_print_timings:        load time =     768.02 ms
llama_print_timings:      sample time =      25.27 ms /    40 runs   (    0.63 ms per token,  1582.78 tokens per second)
llama_print_timings: prompt eval time =     766.86 ms /  1094 tokens (    0.70 ms per token,  1426.60 tokens per second)
llama_print_timings:        eval time =    5932.53 ms /    39 runs   (  152.12 ms per token,     6.57 tokens per second)
llama_print_timings:       total time =   17136.39 ms /  1133 tokens

Answered by sebastienbo

Apr 25, 2024

There are 2 possible reasons:

you are not running on GPU and have a recent cpu with performance cores and efficiency cores. If the process does not force itself onto the performance cores then the os will send some proccesing to the performance cores and others to the efficiency cores.
You have 2 gpu's: One cpu integrated gpu and one dedicated gpu. same problem applies here: If the process does not give a preference to use the fastest GPU, then you might have some processing send to your small gpu (igpu) and other send to your dedicated gpu

It comes down to the code (process) to give a preference. If it does not, the OS will choose.

Same thing could happen with NPU's. NPU's might be…

View full answer

sebastienbo · 2024-04-25T12:16:49Z

sebastienbo
Apr 25, 2024

There are 2 possible reasons:

you are not running on GPU and have a recent cpu with performance cores and efficiency cores. If the process does not force itself onto the performance cores then the os will send some proccesing to the performance cores and others to the efficiency cores.
You have 2 gpu's: One cpu integrated gpu and one dedicated gpu. same problem applies here: If the process does not give a preference to use the fastest GPU, then you might have some processing send to your small gpu (igpu) and other send to your dedicated gpu

It comes down to the code (process) to give a preference. If it does not, the OS will choose.

Same thing could happen with NPU's. NPU's might be much faster then GPU's for LLM traversal because they are optimised for it.
But if the llama process does not give a preference to the OS in terms of using NPU, it could be that the GPU or CPU is getting the job.

Best thing the developer can do, is to do a micro mini bench towards all the available devices and see which one resolves the problem the fastest. for example There are :
4 performance cores
4 efficiency cores
1 npu
1 IGPU
1 Descrete GPU

llama.cpp could simmulate a "Hello world" to each of these hardware processors and see which one comes out as a winner. And use the winner as the device on which llama.cpp should work. that way the OS will not hop-on-off on those devices. it will strictly follow the developers preference.

2 replies

sebastienbo Apr 25, 2024

In mean while the developer tackles this issue: I think you can avoid processes to be send to the efficiency cores.

Can you do a test to see which device is processing your llama.cpp requests? is it running on cpu ? gpu? igpu?

gopalgk Apr 25, 2024
Author

I noticed main.exe was built from older version of repository. I generated the quantized models on another machine (which had latest scripts). On the master branch there is no issue, I get consistently 50 tokens/sec.

Thanks for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3 generation time (GPU) #6878

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

llama3 generation time (GPU) #6878

gopalgk Apr 24, 2024

Replies: 1 comment · 2 replies

sebastienbo Apr 25, 2024

sebastienbo Apr 25, 2024

gopalgk Apr 25, 2024 Author

gopalgk
Apr 24, 2024

Replies: 1 comment 2 replies

sebastienbo
Apr 25, 2024

gopalgk Apr 25, 2024
Author