A simple chat interface to run the Llama 3 model locally using OpenVINO Runtime for inference, transformers library for tokenization and Flask for the chat interface.
-
Build the docker image with the following command. The source files and model weights are pulled using git, requiring an active internet connection.
docker build -t chat-llama .
You can optionally pass the
--no-cache
flag to build with the latest upstream changes. -
Start the container using:
docker run -p 5000:5000 chat-llama
This should start the Flask dev server available on
http://localhost:5000
- Python 3.11
To download the original model weights from HuggingFace, visit the HuggingFace model page and accept their License. Once your request has been accepted, use huggingface-cli
to login to your HuggingFace account in your current runtime with the following command:
huggingface-cli login
-
For the INT-4 quantized
Meta-Llama-3-8B-Instruct
model already converted to the OpenVINO IR format from HuggingFace, you can use the following command:huggingface-cli download rajatkrishna/Meta-Llama-3-8B-Instruct-OpenVINO-INT4 --local-dir models/llama-3-instruct-8b
-
Clone the repository
git clone https://github.com/rajatkrishna/llama3-openvino
-
Create a new virtual environment to avoid dependency conflicts:
python3 -m venv create .env source .env/bin/activate
-
Install the dependencies in
requirements.txt
pip install -r requirements.txt
-
Start the flask server from the project root using
python3 -m flask run
-
To export the meta-llama/Meta-Llama-3-8B-Instruct model quantized to INT-8 format yourself using optimum-intel CLI, install the requirements in
requirements_export.txt
:pip install -r requirements_export.txt
Then run the following from the project root:
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 models/llama-3-instruct-8b
-
Alternately, use the following steps to export the INT-4 quantized model using the Python API:
-
Import the dependencies:
>>> from optimum.intel.openvino import OVWeightQuantizationConfig, OVModelForCausalLM >>> from transformers import AutoTokenizer
-
Load the model using
OVModelForCausalLM
class. Setexport=True
to export the model on the fly.>>> export_path = "models/llama-3-instruct-8b" >>> q_config = OVWeightQuantizationConfig(bits=4, sym=True, group_size=128) >>> model = OVModelForCausalLM.from_pretrained(model_name, export=True, quantization_config=q_config) >>> model.save_pretrained(export_path)
-
Now use
AutoTokenizer
to save the tokenizer.>>> tokenizer = AutoTokenizer.from_pretrained(model_name) >>> tokenizer.save_pretrained(export_path)
-