GitHub - gotzmann/booster: Booster

Booster, according to Merriam-Webster dictionary:

an auxiliary device for increasing force, power, pressure, or effectiveness
the first stage of a multistage rocket providing thrust for the launching and the initial part of the flight

Large Model Booster aims to be an simple and mighty LLM inference accelerator both for those who needs to scale GPTs within production environment or just experiment with models on its own.

Superpowers

Built with performance and scaling in mind thanks Golang and C++
No more problems with Python dependencies and broken compatibility
Most of modern CPUs are supported: any Intel/AMD x64 platofrms, server and Mac ARM64
GPUs supported as well: Nvidia CUDA, Apple Metal, OpenCL cards
Split really big models between a number of GPU (warp LLaMA 70B with 2x RTX 3090)
Not bad performance on shy CPU machines, fast as hell inference on monsters with beefy GPUs
Both regular FP16/FP32 models and their quantised versions are supported - 4-bit really rocks!
Popular LLM architectures already there: LLaMA, Starcoder, Baichuan, Mistral, etc...
Special bonus: proprietary Janus Sampling for code generation and non English languages

Motivation

Within first month of llama.go development I was literally shocked of how original ggml.cpp project made it very clear - there are no limits for talented people on bringing mind-blowing features and moving to AI future.

So I've decided to start a new project where best-in-class C++ / CUDA core will be embedded into mighty Golang server ready for robust and performant inference at large scale within real production environments.

V0 Roadmap - Fall'23

V1 Roadmap - Winter'23

Rebrand project: LLaMAZoo => Large Model Collider
Is it 2023, 30th of November? First birthday of ChatGPT! Celebrate ...
... then release Collider V1 after half a year of honing it :)

V2 Roadmap - Spring'24

Full LLaMA v2 support
Freeze JSON / YAML config format for Native API

V3 Roadmap - Summer'24

Rebrand project again :) Collider => Booster
Complete LLaMA v3 support
Release OpenAI API compatible endpoints
Allow native Windows support
Prebuilt binaries for all platforms
Support LLaVA multi-modal models inference
Better test coverage
Perplexity computation [ useful for benchmarking ]

How to build on Mac?

Booster was (and still) developed on Mac with Apple Silicon M1 processor, so it's really easy peasy:

make mac

How to compile for CUDA on Ubuntu?

Follow step 1 and step 2, then just make!

Ubuntu Step 1: Install C++ and Golang compilers, as well some developer libraries

sudo apt update -y && sudo apt upgrade -y && \
apt install -y git git-lfs make build-essential && \
wget https://golang.org/dl/go1.21.5.linux-amd64.tar.gz && \
tar -xf go1.21.5.linux-amd64.tar.gz -C /usr/local && \
rm go1.21.5.linux-amd64.tar.gz && \
echo 'export PATH="${PATH}:/usr/local/go/bin"' >> ~/.bashrc && source ~/.bashrc

Ubuntu Step 2: Install Nvidia drivers and CUDA Toolkit 12.2 with NVCC

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin && \
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub && \
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /" && \
sudo apt update -y && \
sudo apt install -y cuda-toolkit-12-2

Now you are ready to rock!

make cuda

How to Run?

You shold go through steps below:

Build the server from sources [ Mac inference as example ]

make clean && make mac

Download the model, like [ Hermes 2 Pro ] based on [ LLaMA-v3-8B ] quantized to GGUF Q4KM format:

wget https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf

Create configuration file and place it to the same directory [ see config.sample.yaml ]

id: mac
host: localhost
port: 8080
log: booster.log
deadline: 180
debug:
swap:

pods: 
  gpu:
    model: hermes
    prompt: chat
    sampling: janus
    threads: 1
    gpus: [ 100 ]
    batchsize: 512

models:
  hermes:
    name: Hermes2 Pro 8B
    path: ~/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf
    contextsize: 8192
    predict: 1024

prompts:

  chat:
    locale: ru_RU
    system: "<|im_start|>system\nToday is {DATE}. You are virtual assistant. Please answer the question.<|im_end|>"
    user: "\n<|im_start|>user\n{USER}<|im_end|>"
    assistant: "\n<|im_start|>assistant\n{ASSISTANT}<|im_end|>"

samplings:

  janus:
    janus: 1
    depth: 200
    scale: 0.97
    hi: 0.99
    lo: 0.96

  mirostat:
    mirostat: 0
    mirostatent: 3.0
    mirostatlr: 0.1

  basic:
    temperature: 0.8
    top_k: 8
    topp: 0.9
    typicalp: 1.0
    repetition_penalty: 1.1
    penaltylastn: 200

When all is done, start the server with debug enabled to be sure it working

./booster --server --debug

Now POST JSON with unique ID and your question to localhost:8080/jobs

{
    "id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc6",
    "prompt": "Who are you?"
}

See instructions within booster.service file on how to create daemond service out of this API server.

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
cpp		cpp
pkg		pkg
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
booster.go		booster.go
booster.service		booster.service
booster_cpu.go		booster_cpu.go
config.sample.json		config.sample.json
config.sample.yaml		config.sample.yaml
go.mod		go.mod
go.sum		go.sum
logo.jpg		logo.jpg

License

gotzmann/booster

Folders and files

Latest commit

History

Repository files navigation

Superpowers

Motivation

V0 Roadmap - Fall'23

V1 Roadmap - Winter'23

V2 Roadmap - Spring'24

V3 Roadmap - Summer'24

How to build on Mac?

How to compile for CUDA on Ubuntu?

How to Run?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages