-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Ollama models apparently affected by llama.cpp BPE pretokenization issue #4126
Comments
Any update on this? |
I found this post because I'm getting the same message and trying to find ways to deal with that: llm_load_vocab: missing pre-tokenizer type, using: 'default' |
Will this llama.cpp merge ggerganov/llama.cpp#6965, fix this issue? |
Curious about this as well. Hopeful the updated llamacpp will be merged and models updated. |
I'm having the same issue llm_load_vocab: missing pre-tokenizer type, using: 'default' Does anyone know what can be done about it? or explain the issue to a "newbie" in Ollama / AI ? |
Seeing the same message. Running llama3:70b-instruct |
Same here, llama3:8b |
The same, using derivative from llama3: GENERATION QUALITY WILL BE DEGRADED! |
llm_load_vocab: missing pre-tokenizer type, using: 'default' |
coming from here: https://www.reddit.com/r/LocalLLaMA/comments/1cg0z1i/comment/l1su102/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button lead me to put the attached patch as "llm/patches/06-llama.cpp.diff" and then build ollama (trying to pass through the override-kv from llm/ext_server/server.cpp was a bit tedious since that override would be of type str and that is not handled in the linked version of llama.cpp [although upstream has a fix for it]) EDIT: Just saw, "llm/patches/05-default-pretokenizer.diff" in v0.1.39 does pretty much that (and more) |
The crux of the matter is: all models have to be re-converted and then re-quantized. You can dive into the issues/PRs I initially posted to learn more, but that's the super-short version. Until the underlying Follow instructions here to learn how to import to Ollama from other formats (including from those available on Huggingface.io. Run: > ollama show --modelfile gemma:instruct # <modelname> can be any model in your library/cache
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM gemma:instruct
FROM /Users/andrew/.ollama/models/blobs/sha256-ef311de6af9db043d51ca4b1e766c28e0a1ac41d60420fed5e001dc470c64b77
TEMPLATE "<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"
PARAMETER penalize_newline false
PARAMETER repeat_penalty 1
PARAMETER stop <start_of_turn>
PARAMETER stop <end_of_turn>
LICENSE """Gemma Terms of Use
Last modified: February 21, 2024
By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
<license truncated for ease of reading> Copy that entire output in your favorite text editor (e.g. nano, vim...) and make a new file and call it literally whatever you want. I have a folder in my home directory that's just random modelfiles that I can use to import small changes quickly (I think that it's easier than having to Now you replace the first line of that file with a path to your converted GGUF file: FROM /path/to/your/models/model.gguf
TEMPLATE "<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
"
PARAMETER penalize_newline false
PARAMETER repeat_penalty 1
PARAMETER stop <start_of_turn>
PARAMETER stop <end_of_turn> |
What is the issue?
See the following llama.cpp issues/PRs:
Using updated
llama.cpp
builds and having done a little digging under the hood on the BPE issue, this is an example verbose output when startingollama serve
:Calling python code essentially distills down to:
I think the fix will be re-converting and re-quantizing all of these models, which is what the folks in llama.cpp-world are doing now.
OS
macOS
GPU
Apple
CPU
Apple
Ollama version
0.1.33
The text was updated successfully, but these errors were encountered: