llama3 generation time (GPU) #6878
-
Hello all, I am running llama3-8B-instruct Q6_K model on dual GV100 GPUs. I get very inconsistent timing results while using the model. Sometimes I get 50 tps , sometimes 5 tps. Also notice the total time in the slower case is multiple times the sum of the components. What could be the cause of this unreliable behavior? Thanks.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
There are 2 possible reasons:
It comes down to the code (process) to give a preference. If it does not, the OS will choose. Same thing could happen with NPU's. NPU's might be much faster then GPU's for LLM traversal because they are optimised for it. Best thing the developer can do, is to do a micro mini bench towards all the available devices and see which one resolves the problem the fastest. for example There are : llama.cpp could simmulate a "Hello world" to each of these hardware processors and see which one comes out as a winner. And use the winner as the device on which llama.cpp should work. that way the OS will not hop-on-off on those devices. it will strictly follow the developers preference. |
Beta Was this translation helpful? Give feedback.
There are 2 possible reasons:
you are not running on GPU and have a recent cpu with performance cores and efficiency cores. If the process does not force itself onto the performance cores then the os will send some proccesing to the performance cores and others to the efficiency cores.
You have 2 gpu's: One cpu integrated gpu and one dedicated gpu. same problem applies here: If the process does not give a preference to use the fastest GPU, then you might have some processing send to your small gpu (igpu) and other send to your dedicated gpu
It comes down to the code (process) to give a preference. If it does not, the OS will choose.
Same thing could happen with NPU's. NPU's might be…