Asynchronous request parsing & prompt processing in server #7280

tristandruyen · 2024-05-14T12:12:32Z

tristandruyen
May 14, 2024

Description

Response streaming using SSE is awesome to reduce perceived interaction latency during token generation, but it does nothing for prompt processing latency.
I propose adding support for parsing & processing completion requests asynchronously in the examples/server of llama.cpp.
The essential part here would be to process the prompt, while the request is still ongoing, hiding the prompt processing latency as the model can start streaming it's response basically immediately after the request is finished.
This is probably a pretty complex feature so feel free to judge it out of scope and close the idea, it would be pretty awesome though :)

Motivation

The inspiration for this feature comes from the low latency observed in the GPT-4o demo.
The feature would be particularly beneficial for voice-assistant use cases.
By enabling asynchronous prompt processing, we can achieve much lower response latency when e.g. whisper.cpp to transcribe & pipe audio into llama.cpp to build a FOSS voice-assistant.

Open Questions

I would appreciate feedback and suggestions from the community and maintainers regarding this idea. Here are a few questions to consider:

How would this even work internally ? I've thought about it from the API consumer side, but have too little c++ skills to judge implementation difficulty in the internal parts.
Is this even in-scope, or is it to complex to implement ?
How would the API work ? Can we adapt an existing endpoint ? Should we add a new one ?
Are there any other considerations or potential challenges that should be taken into account ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous request parsing & prompt processing in server #7280

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Asynchronous request parsing & prompt processing in server #7280

tristandruyen May 14, 2024

Description

Motivation

Open Questions

Replies: 0 comments

tristandruyen
May 14, 2024