-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the feed_prompt process so slow? #439
Comments
Hey there! Thanks for reporting this and providing lots of detail :) The issue here is that the version of GGML we use doesn’t support a specific operation required for feeding more than one token at a time with Metal (i.e. this works fine with CUDA, not Metal). See also #403. This has been fixed in upstream GGML/llama.cpp, but we haven’t integrated that fix yet. The work has started in #428 and that should hopefully be finished within the next week (I’m out of town but I hope to get back to it soon). Hope that helps clarify the state of affairs! |
I'm very happy to hear this news and looking forward to the merged version. Thank you for your work. Can I wait until after the release to close this issue? |
hello @philpax has there been any recent movement on this? |
I started working on it, but realised that it would end up being quite a large task. Still working on it, but it'll take some time. |
thanks |
LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of
feed_prompt
is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation thatfeed_prompt
currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.Using the same model and prompt, I tested with
llama.cpp
, and its first token response time is very fast. I'm not sure what the difference is in thefeed_prompt
process betweenllm
andllama.cpp
. By observing CPU history and GPU history,It seems likellama.cpp
is fully utilizing the GPU for inference.Can you please help me identify what's wrong?
Model:
System:
llama.cpp command:
llama.cpp Result:
llm sample code:
llm sample code result:
The text was updated successfully, but these errors were encountered: