Merge branch 'Mozilla-Ocho:main' into readme-instaling-a-llamafile

Mozilla-Ocho · May 13, 2024 · 9503aea · 9503aea
2 parents 6918b30 + d4099fe
commit 9503aea
Show file tree

Hide file tree

Showing 176 changed files with 37,155 additions and 23,609 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
diff --git a/Makefile b/Makefile
@@ -28,13 +28,15 @@ install:	llamafile/zipalign.1					\
 		llama.cpp/perplexity/perplexity.1			\
 		llama.cpp/llava/llava-quantize.1			\
 		o/$(MODE)/llamafile/zipalign				\
+		o/$(MODE)/llamafile/tokenize				\
 		o/$(MODE)/llama.cpp/main/main				\
 		o/$(MODE)/llama.cpp/imatrix/imatrix			\
 		o/$(MODE)/llama.cpp/quantize/quantize			\
 		o/$(MODE)/llama.cpp/perplexity/perplexity		\
 		o/$(MODE)/llama.cpp/llava/llava-quantize
 	mkdir -p $(PREFIX)/bin
 	$(INSTALL) o/$(MODE)/llamafile/zipalign $(PREFIX)/bin/zipalign
+	$(INSTALL) o/$(MODE)/llamafile/tokenize $(PREFIX)/bin/llamafile-tokenize
 	$(INSTALL) o/$(MODE)/llama.cpp/main/main $(PREFIX)/bin/llamafile
 	$(INSTALL) o/$(MODE)/llama.cpp/imatrix/imatrix $(PREFIX)/bin/llamafile-imatrix
 	$(INSTALL) o/$(MODE)/llama.cpp/quantize/quantize $(PREFIX)/bin/llamafile-quantize

diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ chmod +x llava-v1.5-7b-q4.llamafile
 5. Run the llamafile. e.g.:
 
 ```sh
-./llava-v1.5-7b-q4.llamafile -ngl 9999
+./llava-v1.5-7b-q4.llamafile
 ```
 
 6. Your browser should open automatically and display a chat interface. 
@@ -88,7 +88,7 @@ README](llama.cpp/server/README.md#api-endpoints).
 <summary>Curl API Client Example</summary>
 
 The simplest way to get started using the API is to copy and paste the
-following curl command into your terminal.
+following curl comd into your terminal.
 
 ```shell
 curl http://localhost:8080/v1/chat/completions \
@@ -185,33 +185,35 @@ ChatCompletionMessage(content='There once was a programmer named Mike\nWho wrote
 We also provide example llamafiles for other models, so you can easily
 try out llamafile with different kinds of LLMs.
 
-| Model                  | Size     | License                                                                        | llamafile                                                                                                                                                                                      |
-| ---                    | ---      | ---                                                                            | ---                                                                                                                                                                                            |
-| LLaVA 1.5              | 3.97 GB  | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [llava-v1.5-7b-q4.llamafile](https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true)                                                          |
-| Mistral-7B-Instruct    | 5.15 GB  | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                  | [mistral-7b-instruct-v0.2.Q5\_K\_M.llamafile](https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.llamafile?download=true)          |
-| Mixtral-8x7B-Instruct  | 30.03 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                  | [mixtral-8x7b-instruct-v0.1.Q5\_K\_M.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true)    |
-| WizardCoder-Python-34B | 22.23 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-34b-v1.0.Q5\_K\_M.llamafile](https://huggingface.co/jartine/WizardCoder-Python-34B-V1.0-llamafile/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile?download=true) |
-| WizardCoder-Python-13B | 7.33 GB  | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) | [wizardcoder-python-13b.llamafile](https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile?download=true)                                          |
-| TinyLlama-1.1B         | 0.76 GB  | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                  | [TinyLlama-1.1B-Chat-v1.0.Q5\_K\_M.llamafile](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile?download=true)               |
-| Rocket-3B              | 1.89 GB  | [cc-by-sa-4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)         | [rocket-3b.Q5\_K\_M.llamafile](https://huggingface.co/jartine/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile?download=true)                                                       |
-| Phi-2                  | 1.96 GB  | [MIT](https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE)             | [phi-2.Q5\_K\_M.llamafile](https://huggingface.co/jartine/phi-2-llamafile/resolve/main/phi-2.Q5_K_M.llamafile?download=true)                                                                   |
+| Model                  | Size     | License                                                                                                                            | llamafile                                                                                                                                                                                      | other quants                                                                        |
+| ---                    | ---      | ---                                                                                                                                | ---                                                                                                                                                                                            | ---                                                                                 |
+| LLaVA 1.5              | 3.97 GB  | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)                                                     | [llava-v1.5-7b-q4.llamafile](https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true)                                                          | [See HF repo](https://huggingface.co/jartine/llava-v1.5-7B-GGUF)                    |
+| TinyLlama-1.1B         | 2.05 GB  | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                                                                      | [TinyLlama-1.1B-Chat-v1.0.F16.llamafile](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile?download=true)                       | [See HF repo](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF)         |
+| Mistral-7B-Instruct    | 3.85 GB  | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                                                                      | [mistral-7b-instruct-v0.2.Q4\_0.llamafile](https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_0.llamafile?download=true)               | [See HF repo](https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile)    |
+| Phi-3-mini-4k-instruct | 7.67 GB  | [Apache 2.0](https://huggingface.co/jartine/Phi-3-mini-4k-instruct-llamafile/blob/main/LICENSE)                                    | [Phi-3-mini-4k-instruct.F16.llamafile](https://huggingface.co/jartine/Phi-3-mini-4k-instruct-llamafile/resolve/main/Phi-3-mini-4k-instruct.F16.llamafile?download=true)                        | [See HF repo](https://huggingface.co/jartine/Phi-3-mini-4k-instruct-llamafile)      |
+| Mixtral-8x7B-Instruct  | 30.03 GB | [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/)                                                                      | [mixtral-8x7b-instruct-v0.1.Q5\_K\_M.llamafile](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true)    | [See HF repo](https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile)  |
+| WizardCoder-Python-34B | 22.23 GB | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)                                                     | [wizardcoder-python-34b-v1.0.Q5\_K\_M.llamafile](https://huggingface.co/jartine/WizardCoder-Python-34B-V1.0-llamafile/resolve/main/wizardcoder-python-34b-v1.0.Q5_K_M.llamafile?download=true) | [See HF repo](https://huggingface.co/jartine/WizardCoder-Python-34B-V1.0-llamafile) |
+| WizardCoder-Python-13B | 7.33 GB  | [LLaMA 2](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)                                                     | [wizardcoder-python-13b.llamafile](https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile?download=true)                                          | [See HF repo](https://huggingface.co/jartine/wizardcoder-13b-python)                |
+| LLaMA-3-Instruct-70B   | 37.25 GB | [llama3](https://huggingface.co/jartine/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-70B-Instruct.Q4\_0.llamafile](https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile/resolve/main/Meta-Llama-3-70B-Instruct.Q4_0.llamafile?download=true)            | [See HF repo](https://huggingface.co/jartine/Meta-Llama-3-70B-Instruct-llamafile)   |
+| LLaMA-3-Instruct-8B    | 5.37 GB  | [llama3](https://huggingface.co/jartine/Meta-Llama-3-8B-Instruct-llamafile/blob/main/Meta-Llama-3-Community-License-Agreement.txt) | [Meta-Llama-3-8B-Instruct.Q5\_K\_M.llamafile](https://huggingface.co/jartine/Meta-Llama-3-8B-Instruct-llamafile/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile?download=true)          | [See HF repo](https://huggingface.co/jartine/Meta-Llama-3-8B-Instruct-llamafile)    |
+| Rocket-3B              | 1.89 GB  | [cc-by-sa-4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)                                                             | [rocket-3b.Q5\_K\_M.llamafile](https://huggingface.co/jartine/rocket-3B-llamafile/resolve/main/rocket-3b.Q5_K_M.llamafile?download=true)                                                       | [See HF repo](https://huggingface.co/jartine/rocket-3B-llamafile)                   |
 
 Here is an example for the Mistral command-line llamafile:
 
 ```sh
-./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 9999 --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
+./mistral-7b-instruct-v0.2.Q5_K_M.llamafile --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
 ```
 
 And here is an example for WizardCoder-Python command-line llamafile:
 
 ```sh
-./wizardcoder-python-13b.llamafile -ngl 9999 --temp 0 -e -r '```\n' -p '```c\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'
+./wizardcoder-python-13b.llamafile --temp 0 -e -r '```\n' -p '```c\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'
 ```
 
 And here's an example for the LLaVA command-line llamafile:
 
 ```sh
-./llava-v1.5-7b-q4.llamafile -ngl 9999 --temp 0.2 --image lemurs.jpg -e -p '### User: What do you see?\n### Assistant:'
+./llava-v1.5-7b-q4.llamafile --temp 0.2 --image lemurs.jpg -e -p '### User: What do you see?\n### Assistant:'
 ```
 
 As before, macOS, Linux, and BSD users will need to use the "chmod"
@@ -281,7 +283,7 @@ For Windows users, here's an example for the Mistral LLM:
 ```sh
 curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.6/llamafile-0.6
 curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
-./llamafile.exe -m mistral.gguf -ngl 9999
+./llamafile.exe -m mistral.gguf
 ```
 
 Windows users may need to change `./llamafile.exe` to `.\llamafile.exe`
@@ -438,7 +440,7 @@ llama.cpp command line interface, utilizing WizardCoder-Python-13B
 weights:
 
 ```sh
-llamafile -ngl 9999 \
+llamafile \
   -m wizardcoder-python-13b-v1.0.Q8_0.gguf \
   --temp 0 -r '}\n' -r '```\n' \
   -e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'
@@ -589,7 +591,7 @@ that describes the changes, and mention it in your Hugging Face commit.
 
 ## Documentation
 
-There's a man page for each of the llamafile programs installed when you
+There's a manual page for each of the llamafile programs installed when you
 run `sudo make install`. The command manuals are also typeset as PDF
 files that you can download from our GitHub releases page. Lastly, most
 commands will display that information when passing the `--help` flag.

diff --git a/build/config.mk b/build/config.mk
@@ -2,7 +2,7 @@
 #── vi: set noet ft=make ts=8 sw=8 fenc=utf-8 :vi ────────────────────┘
 
 PREFIX = /usr/local
-COSMOCC = .cosmocc/3.3.3
+COSMOCC = .cosmocc/3.3.6
 TOOLCHAIN = $(COSMOCC)/bin/cosmo
 
 AR = $(TOOLCHAIN)ar
@@ -13,9 +13,9 @@ MKDEPS = $(COSMOCC)/bin/mkdeps
 INSTALL = install
 
 ARFLAGS = rcsD
-CCFLAGS = -g -O3 -fexceptions
-CPPFLAGS_ = -iquote. -mcosmo -DGGML_MULTIPLATFORM -Wno-attributes
-TARGET_ARCH = -Xx86_64-mavx -Xx86_64-mtune=alderlake
+CCFLAGS = -g -O3 -fexceptions -fsignaling-nans
+CPPFLAGS_ = -iquote. -mcosmo -DGGML_MULTIPLATFORM -Wno-attributes -DLLAMAFILE_DEBUG
+TARGET_ARCH = -Xx86_64-mavx -Xx86_64-mtune=znver4
 
 TMPDIR = o//tmp
 IGNORE := $(shell mkdir -p $(TMPDIR))
@@ -50,5 +50,7 @@ clean:; rm -rf o
 .PHONY: distclean
 distclean:; rm -rf o .cosmocc
 
-.cosmocc/3.3.3:
-	build/download-cosmocc.sh $@ 3.3.3 e4d0fa63cd79cc3bfff6c2d015f1776db081409907625aea8ad40cefc1996d08
+.cosmocc/3.3.6:
+	build/download-cosmocc.sh $@ 3.3.6 26e3449357f31b82489774ef5c2d502a711bb711d4faf99a5fd6c96328a1c205
+
+
diff --git a/llama.cpp/BUILD.mk b/llama.cpp/BUILD.mk
@@ -25,7 +25,11 @@ include llama.cpp/imatrix/BUILD.mk
 include llama.cpp/quantize/BUILD.mk
 include llama.cpp/perplexity/BUILD.mk
 
-$(LLAMA_CPP_OBJS): private CCFLAGS += -DGGML_MULTIPLATFORM
+$(LLAMA_CPP_OBJS): private				\
+		CCFLAGS +=				\
+			-DNDEBUG			\
+			-DGGML_MULTIPLATFORM		\
+			-DGGML_USE_LLAMAFILE
 
 o/$(MODE)/llama.cpp/ggml-alloc.o			\
 o/$(MODE)/llama.cpp/ggml-backend.o			\
@@ -39,6 +43,20 @@ o/$(MODE)/llama.cpp/ggml-alloc.o			\
 o/$(MODE)/llama.cpp/common.o: private			\
 		CCFLAGS += -Os
 
+o/$(MODE)/llama.cpp/ggml-quants.o: private CXXFLAGS += -Os
+o/$(MODE)/llama.cpp/ggml-quants-amd-avx.o: private TARGET_ARCH += -Xx86_64-mtune=sandybridge
+o/$(MODE)/llama.cpp/ggml-quants-amd-avx2.o: private TARGET_ARCH += -Xx86_64-mtune=skylake -Xx86_64-mf16c -Xx86_64-mfma -Xx86_64-mavx2
+o/$(MODE)/llama.cpp/ggml-quants-amd-avx512.o: private TARGET_ARCH += -Xx86_64-mtune=cannonlake -Xx86_64-mf16c -Xx86_64-mfma -Xx86_64-mavx2 -Xx86_64-mavx512f
+
+o/$(MODE)/llama.cpp/ggml-vector.o: private CXXFLAGS += -Os
+o/$(MODE)/llama.cpp/ggml-vector-amd-avx.o: private TARGET_ARCH += -Xx86_64-mtune=sandybridge
+o/$(MODE)/llama.cpp/ggml-vector-amd-fma.o: private TARGET_ARCH += -Xx86_64-mtune=bdver2 -Xx86_64-mfma
+o/$(MODE)/llama.cpp/ggml-vector-amd-f16c.o: private TARGET_ARCH += -Xx86_64-mtune=ivybridge -Xx86_64-mf16c
+o/$(MODE)/llama.cpp/ggml-vector-amd-avx2.o: private TARGET_ARCH += -Xx86_64-mtune=skylake -Xx86_64-mf16c -Xx86_64-mfma -Xx86_64-mavx2
+o/$(MODE)/llama.cpp/ggml-vector-amd-avx512.o: private TARGET_ARCH += -Xx86_64-mtune=cannonlake -Xx86_64-mf16c -Xx86_64-mfma -Xx86_64-mavx2 -Xx86_64-mavx512f
+o/$(MODE)/llama.cpp/ggml-vector-amd-avx512bf16.o: private TARGET_ARCH += -Xx86_64-mtune=znver4 -Xx86_64-mf16c -Xx86_64-mfma -Xx86_64-mavx2 -Xx86_64-mavx512f -Xx86_64-mavx512vl -Xx86_64-mavx512bf16
+o/$(MODE)/llama.cpp/ggml-vector-arm82.o: private TARGET_ARCH += -Xaarch64-march=armv8.2-a+fp16
+
 $(LLAMA_CPP_OBJS): llama.cpp/BUILD.mk
 
 .PHONY: o/$(MODE)/llama.cpp

diff --git a/llama.cpp/README.llamafile b/llama.cpp/README.llamafile
@@ -9,23 +9,23 @@ LICENSE
 ORIGIN
 
   https://github.com/ggerganov/llama.cpp/pull/4406/
-  fa046eafbc70bf97dcf39843af0323f19a8c9ac3
-  2024-03-22
+  c780e75305dba1f67691a8dc0e8bc8425838a452
+  2024-05-07
 
 LOCAL MODIFICATIONS
 
-  - Count the number of cores correctly on Intel's Alderlake architecture
   - Remove MAP_POPULATE because it makes mmap(tinyllama) block for 100ms
   - Refactor ggml.c, llama.cpp, and llava to use llamafile_open() APIs
   - Unify main, server, and llava-cli into single llamafile program
   - Make cuBLAS / hipBLAS optional by introducing tinyBLAS library
   - Add support to main() programs for Cosmo /zip/.args files
   - Introduce pledge() SECCOMP sandboxing to improve security
   - Call exit() rather than abort() when GGML_ASSERT() fails
+  - Clamp bf16/f32 values before passing to K quantizers
   - Make GPU logger callback API safer and less generic
   - Write log to /dev/null when main.log fails to open
-  - Use _rand64() rather than time() as default seed
   - Make main and llava-cli print timings on ctrl-c
+  - Make emebeddings CLI program shell scriptable
   - Avoid bind() conflicts on port 8080 w/ server
   - Use runtime dispatching for matmul quants
   - Remove operating system #ifdef statements