Asynchronous request parsing & prompt processing in server #7280
tristandruyen
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description
Response streaming using SSE is awesome to reduce perceived interaction latency during token generation, but it does nothing for prompt processing latency.
I propose adding support for parsing & processing completion requests asynchronously in the
examples/server
of llama.cpp.The essential part here would be to process the prompt, while the request is still ongoing, hiding the prompt processing latency as the model can start streaming it's response basically immediately after the request is finished.
This is probably a pretty complex feature so feel free to judge it out of scope and close the idea, it would be pretty awesome though :)
Motivation
The inspiration for this feature comes from the low latency observed in the GPT-4o demo.
The feature would be particularly beneficial for voice-assistant use cases.
By enabling asynchronous prompt processing, we can achieve much lower response latency when e.g. whisper.cpp to transcribe & pipe audio into llama.cpp to build a FOSS voice-assistant.
Open Questions
I would appreciate feedback and suggestions from the community and maintainers regarding this idea. Here are a few questions to consider:
Beta Was this translation helpful? Give feedback.
All reactions