Streaming API and Web page for Large Language Models based on Python.
This repository contains:
- Transformers streaming generation: REAL streaming generation for all pre-trained models (based on transformers).
- Flask API: streaming response interface.
- Gradio APP: fast and easy LLM web page.
Take Llama3 for example:
-
Follow Llama3 download to download Meta-Llama-3-8B-Instruct model, or from huggingface / modelscope.
-
Follow Llama3 quick-start to install dependencies for Llama3.
-
Clone this repository and install dependencies:
git clone https://github.com/JinHanLei/LLM-Stream-Service pip install flask gradio transformers
-
Run Flask service:
python llama3_service.py --host 0.0.0.0 --port 8800 --ckpts /Meta-Llama-3-8B-Instruct
Note
- Replace
Meta-Llama-3-8B-Instruct/
with the path to your checkpoint directory.
- Replace
-
Run Gradio service:
gradio llama3_app.py
Note
- Replace the
Address
variable inllama3_app.py
with your service address.
- Replace the
-
The initial streaming output scheme adopted by the project was the TextIteratorStreamer that comes with the official transformers library. However, the generation speed was still very slow. After researching, I found that the TextIteratorStreamer actually converts print-ready text into a streaming structure, meaning that the LLM first needs to generate the entire text block before converting it, which is not what I wanted. I wanted the LLM to yield each token as it is generated.
-
Subsequently, I came across LowinLi's project that truly implemented streaming output for pretrained models. When I eagerly applied it to the Llama3 model, it threw an error. After debugging, I found that Llama3 has two eos_tokens, which caused the loop to generate negative ids. Thus, I made modifications based on this project, cleaned up redundancies, adapted it for Llama3, and made it easier to read and understand.