Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow processing of follow-up prompt #54

Open
woheller69 opened this issue May 9, 2024 · 1 comment
Open

Slow processing of follow-up prompt #54

woheller69 opened this issue May 9, 2024 · 1 comment

Comments

@woheller69
Copy link

In a multi-turn conversation I see that the combination of llama-cpp-python and llama-cpp-agent is much slower on the second prompt than the python bindings of gpt4all. See the 2 screenshots below. The evaluation of the first prompt is faster, probably due to the recent speed improvements for prompt processing which have not yet been adopted in gpt4all. When I reply to that first answer from the AI the second reply of gpt4all comes much faster than the first whereas llama-cpp-python/llama-cpp-agent are even slower than on the first prompt. My setup is CPU only.
Do you have an idea why this is the case? Do they handle memory in a more efficient way?

Llama-3-8b-instruct Q8
Prompt processing
round        gpt4all        llama-cpp-python/agent
1            12.03 s              7.17 s
2             3.73 s              8.46 s

gpt4all
llama-cpp-agent

@woheller69
Copy link
Author

It seems that the model always needs to evaluate its own previous answer as part of the prompt.
In the following examples my own new prompt was quite short every time.
Number of tokens in prompt eval is always the number of previous tokens in answer plus a few more:

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 157.69 ms / 59 runs ( 2.67 ms per token, 374.15 tokens per second)
llama_print_timings: prompt eval time = 143124.12 ms / 356 tokens ( 402.03 ms per token, 2.49 tokens per second)
llama_print_timings: eval time = 28492.60 ms / 58 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 62822.10 ms / 414 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 429.62 ms / 154 runs ( 2.79 ms per token, 358.46 tokens per second)
llama_print_timings: prompt eval time = 8277.50 ms / 88 tokens ( 94.06 ms per token, 10.63 tokens per second)
llama_print_timings: eval time = 75161.81 ms / 153 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 86088.57 ms / 241 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 195.41 ms / 68 runs ( 2.87 ms per token, 347.98 tokens per second)
llama_print_timings: prompt eval time = 15788.26 ms / 163 tokens ( 96.86 ms per token, 10.32 tokens per second)
llama_print_timings: eval time = 33127.41 ms / 67 runs ( 494.44 ms per token, 2.02 tokens per second)
llama_print_timings: total time = 50100.81 ms / 230 tokens

Shouldn`t the model already find a tokenization of its previous answer? Or can it be that the applied chat template differs a bit from what the model used in its answer so it does not recognize it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant