Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llava-cli outputs gibberish #6944

Closed
Any-Winter-4079 opened this issue Apr 27, 2024 · 9 comments
Closed

llava-cli outputs gibberish #6944

Any-Winter-4079 opened this issue Apr 27, 2024 · 9 comments

Comments

@Any-Winter-4079
Copy link

Any-Winter-4079 commented Apr 27, 2024

I am using an M1, on commit 928e0b7.

When I run

./llava-cli -m ./models/llava-v1.6-mistral-7b/ggml-mistral-7b-q_5_k.gguf --mmproj ./models/llava-v1.6-mistral-7b/mmproj-mistral7b-f16-q6_k.gguf -p 'Describe the image.' --image ./models/llava-v1.6-mistral-7b/S.jpg -c 4096

I get:

to to a,, new do s is and d t, in in is, r is to and,, h to a and is is t.. for s has m d,., a. is,, m he un, and b st a to to
#. r. d t n is the to.' with, l the to., is is and t r and a t in, re h s d a is, is l in in as as is r in un,., to t t h a.. t on as a, st' > > the to a, t r is a the, d # is and the p r as t is, as, h has to do the in. in as c m. a is l the is to n st has on on r t, s' is new to, t a is, and/ he in g and/ is, is re, with at c is in' d,. at p c and/ is is.. n is h on a the and/,. is,- b the the on. at is is h do. and/ re, r in s, on for to to p b n,.. a as

I also tried with xtuner/llava-phi-3-mini with similar results.

./llava-cli -m ./models/llava-phi-3-mini/ggml-model-f16.gguf --mmproj ./models/llava-phi-3-mini/mmproj-model-f16.gguf -p 'Describe the image.' --image ./models/llava-v1.6-mistral-7b/S.jpg -c 4096

â } on }RESS, and thus,uminate, whereas, } }raint, but, }, }ioned, while, } }RESS, especially and hence, }ktion, }RESS,RESS,abeth,uminate,オ,RESS, }ività,RESS,ício,raint, }derr, }derr, and thus,RESS, }ución, }abeth, } }rache,RESS,abeth, }RESS, }RESS, }ionali,RESS, } }RESS, }eign, and thus,RESS, }abeth,RESS,ício,ioned,abeth,ício,derr,iry,RESS, }RESS, }RESS,otal,umm,RESS,RESS,オ, }RESS,annels, }iseconds, } } } }abeth,bose,abeth,ershell, }ktion, }RESS, }uminate,RESS, }idor, }RESS,ktion,abeth,ício,RESS,オ, }esses, perhapsRESS,abeth, } }abeth,RESS,RESS,RESS,uminate,derr,derr,bose,ività, makunivers, } }ício, }ioned,otal,ershell,abeth,RESS,RESS,RESS, } }`RESS,RESS,RESS,RESS,abeth,uminate,

Have there been any breaking changes to the llava-cli command? Can someone reproduce this issue as well?

@Any-Winter-4079
Copy link
Author

This commit does work 8a56075

The image presents a simple yet striking visual. Dominating the frame is a single, vertical line, painted in a vibrant shade of green. The line is slightly curved, adding a sense of dynamism to the composition. It stands out starkly against a solid blue background, creating a strong contrast that draws the viewer's attention. The line is positioned on the right side of the image, leaving the left side untouched and open. The image is devoid of any text or additional elements, making the green line the sole focus of this composition.

This one (next commit) makes it output gibberish f4dea7d

, whereas,RESS, and thus, }, } }RESS, but, }, } }RESS, and thus,RESS,RESS, } }RESS, and thus, } }RESS, } }RESS, } }RESS, } } } }RESS, and thus,RESS, } }RESS, } } }RESS,RESS, } }RESS, } } }RESS, }RESS,RESS,RESS, } }RESS,RESS, }RESS,RESS, } } }RESS,RESS, } }RESS,RESS,RESS,RESS,RESS,RESS, } } }RESS, } }RESS, } }RESS, } } } }RESS,RESS, } } }RESS,RESS,RESS,RESS, }RESS,RESS,RESS,RESS, } } } }RESS,RESS,RESS,RESS,RESS,RESS, } }RESS,RESS,RESS, }RESS,RESS,RESS,RESS, }RESS,RESS,RESS, } } }RESS,RESS, } } } } } } }RESS,RESS,RESS, }RESS, }RESS,RESS, }RESS,RESS,RESS, } } } } } } }RESS,RESS,RESS,RESS,RESS,RESS,RESS,

@Any-Winter-4079
Copy link
Author

Alright, so for my use case, I'll stick to commit 8a56075. If anyone needs more information about my Python version, packages, ... provided they can't reproduce it with arbitrary versions, I'll share them

Hope this helps someone!

@turian
Copy link

turian commented May 7, 2024

Possibly related to the tokenization issues discussed in #7056 #7062 #7049 #7006

@ggerganov
Copy link
Owner

I can't reproduce. This works on latest master with M2 Ultra (both CPU and GPU):

make -j && ./llava-cli -m ./models/llava-7b-v1.6/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.6/mmproj-model-f16.gguf --image ~/Downloads/cat.png -p "What is the animal doing?" --temp 0.0 -ngl 99

@turian Don't think this is related to tokenization changes, since LLaMA 1 and 2, Mistral and Phi-3 all use what we call SPM tokenizer - i.e. no pre-tokenization is done. The tokenization changes only affect models using BPE tokenizer

@turian
Copy link

turian commented May 7, 2024

@ggerganov Thank you for the explanation, but perhaps I need to open another bug report?

I have a colab notebook using TinyLlama-1.1b-1 (SPM according to llama.cpp output) showing that the llama-cpp-python tokenizer gives different output from the HF tokenizer. I am using TheBloke GGUF's which are quite old.

HF tokens + HF model = perplexity ~6
llama tokens + llama model = perplexity ~15
HF tokens + llama model = perplexity ~6
./perplexity.cpp = perplexity ~15

So I'm still seeing that with an old GGUF SPM model that the llama.cpp tokenization is different and bad.

Huggingface tokens: [1, 259, 13, 353, 4755, 350, 5059, 357, 353, 29871, 13, 29871, 13, 4755, 350, 5059, 357, 338, 385, 4223, 2706, 1919, 11456, 322, 24520, 11339, 869, 940, 750, 263, 17838, 732, 29899, 29992, 380, 23693, 6297, 373, 278, 11456, 3652, 450, 6682, 297, 29871, 29906, 29900, 29900, 29900, 869, 910, 471, 5643, 491, 263, 380, 23693, 6297, 297, 278, 1708, 2439, 787, 3971, 491, 11254, 27265, 575, 1919, 607, 471, 8560, 297, 29871, 29906, 29900, 29900, 29896, 472, 278, 7021, 9245, 15521, 869, 940, 750, 263, 17838, 6297, 297, 278, 11456, 3652, 26817, 2259, 897, 287, 297, 29871, 29906, 29900, 29900, 29906, 869, 512, 29871, 29906, 29900, 29900, 29946, 350, 5059, 357, 2982, 287, 263, 6297, 408, 376, 28050, 376, 297, 278, 12720, 376, 22040, 4518, 525, 29879, 13740, 376, 310, 278, 11456, 3652, 450, 6242, 383, 3568, 2056, 540, 5810, 1127, 19963, 29701, 4485, 3767, 549, 322, 360, 20400, 10968, 29875, 869, 940, 471, 4320, 297, 278, 29871, 29906, 29900, 29900, 29945, 24520, 1391, 1953, 310, 278, 14920, 390, 333, 2330, 1708, 29389, 29891, 7509, 1919, 607, 471, 8560, 472, 278, 4942, 398, 15521, 297, 349, 368, 21026, 322, 278, 7567, 631, 678, 542, 23167, 27561, 297, 4517, 869, 940, 471, 10624, 491, 2259, 323, 2593, 1384, 322, 5810, 1127, 19963, 4111, 806, 728, 1450, 1919, 1383, 1662, 796, 19924, 1919, 10686, 13272, 1919, 383, 3417, 261, 15846, 690, 1919, 19122, 347, 624, 11960, 322, 13298, 293, 6573, 869, 29871, 13, 512, 29871, 29906, 29900, 29900, 29953, 1919, 350, 5059, 357, 5810, 1127, 19963, 806, 728, 1450, 297, 278, 1708, 21353, 466, 575, 4034, 3971, 491, 4485, 390, 3496, 29131, 869, 940, 7470, 373, 263, 29871, 29906, 29900, 29900, 29953, 12720, 310, 278, 11456, 3652, 1919, 1938, 14359, 1919, 5643, 491, 263, 6297, 297, 278, 29871, 29906, 29900, 29900, 29955, 24520, 5802, 310, 1128, 304, 10837, 344, 10624, 491, 5875, 347, 390, 473, 446, 869, 1128, 304, 10837, 344, 471, 8560, 472, 24715, 15521, 297, 278, 4517, 6780, 820, 310, 26356, 414, 29885, 389, 322, 23004, 3391, 869, 350, 5059, 357, 5810, 1127, 297, 1023, 12298, 297, 29871, 29906, 29900, 29900, 29947, 1919, 8373, 4366, 6417, 495, 29891, 491, 2706, 28107, 3681, 951, 609, 29875, 1919, 322, 3872, 1989, 349, 3322, 10624, 491, 7137, 368, 6054, 18712, 869, 512, 2610, 29871, 29906, 29900, 29900, 29947, 1919, 350, 5059, 357, 1754, 263, 17838, 10097, 373, 263, 1023, 732, 29899, 29992, 760, 12720, 15232, 310, 278, 11456, 3652, 399, 5086, 278, 16992, 1919, 5643, 491, 385, 10097, 373, 278, 11456, 3652, 6298, 24759, 943, 297, 3979, 29871, 29906, 29900, 29900, 29947, 869, 940, 750, 263, 1162, 1038, 292, 6297, 297, 3006, 23238, 310, 278, 11456, 3652, 6960, 950, 1017, 297, 29871, 29906, 29900, 29896, 29900, 1919, 408, 376, 476, 10243, 383, 1026, 4630, 376, 869, 350, 5059, 357, 5810, 1127, 297, 278, 29871, 29906, 29900, 29896, 29896, 2706, 4702, 10278, 4314, 10624, 491, 3681, 951, 609, 29875, 869, 29871, 13, 29871, 13, 353, 353, 15825, 353, 353, 29871, 13, 29871, 13, 29871, 13, 353, 353, 353, 29871, 29906, 29900, 29900, 29900, 785]
Llama  GGUF tokens: [1, 259, 13, 353, 4755, 12476, 1896, 261, 353, 29871, 13, 29871, 13, 4755, 12476, 1896, 261, 338, 385, 4223, 2706, 1919, 11456, 322, 24520, 11339, 869, 940, 750, 263, 17838, 732, 29899, 29992, 5810, 5393, 6297, 373, 278, 11456, 3652, 450, 6682, 297, 29871, 29906, 29900, 29900, 29900, 869, 910, 471, 5643, 491, 263, 5810, 5393, 6297, 297, 278, 1708, 22167, 1983, 3971, 491, 11254, 14317, 29879, 1919, 607, 471, 8560, 297, 29871, 29906, 29900, 29900, 29896, 472, 278, 7021, 9245, 15521, 869, 940, 750, 263, 17838, 6297, 297, 278, 11456, 3652, 26817, 2259, 897, 287, 297, 29871, 29906, 29900, 29900, 29906, 869, 512, 29871, 29906, 29900, 29900, 29946, 12476, 1896, 261, 2982, 287, 263, 6297, 408, 376, 28050, 376, 297, 278, 12720, 376, 22040, 4518, 525, 29879, 13740, 376, 310, 278, 11456, 3652, 450, 6242, 14152, 29885, 2056, 540, 5810, 1127, 19963, 29701, 4485, 3767, 549, 322, 2452, 1416, 10968, 29875, 869, 940, 471, 4320, 297, 278, 29871, 29906, 29900, 29900, 29945, 24520, 5802, 29879, 310, 278, 14920, 21710, 11671, 1032, 1708, 29389, 29891, 7509, 1919, 607, 471, 8560, 472, 278, 16597, 29885, 15521, 297, 1858, 962, 2438, 322, 278, 7567, 631, 14542, 15519, 371, 27561, 297, 4517, 869, 940, 471, 10624, 491, 2259, 18439, 600, 1384, 322, 5810, 1127, 19963, 4111, 806, 728, 1450, 1919, 1383, 1662, 16753, 1362, 1919, 10686, 13272, 1919, 7347, 643, 15846, 690, 1919, 19122, 347, 7813, 880, 322, 13298, 293, 6573, 869, 29871, 13, 512, 29871, 29906, 29900, 29900, 29953, 1919, 12476, 1896, 261, 5810, 1127, 19963, 806, 728, 1450, 297, 278, 1708, 21353, 19642, 3527, 3971, 491, 4485, 28093, 264, 29131, 869, 940, 7470, 373, 263, 29871, 29906, 29900, 29900, 29953, 12720, 310, 278, 11456, 3652, 1919, 15460, 29879, 1919, 5643, 491, 263, 6297, 297, 278, 29871, 29906, 29900, 29900, 29955, 24520, 5802, 310, 1128, 304, 10837, 344, 10624, 491, 5875, 347, 15915, 29878, 446, 869, 1128, 304, 10837, 344, 471, 8560, 472, 24715, 15521, 297, 278, 4517, 6780, 820, 310, 26356, 414, 2415, 29882, 322, 23004, 3391, 869, 12476, 1896, 261, 5810, 1127, 297, 1023, 12298, 297, 29871, 29906, 29900, 29900, 29947, 1919, 8373, 4366, 6417, 495, 29891, 491, 2706, 28107, 3681, 10255, 2034, 1919, 322, 3872, 1989, 12129, 17608, 29882, 10624, 491, 7137, 368, 6054, 18712, 869, 512, 2610, 29871, 29906, 29900, 29900, 29947, 1919, 12476, 1896, 261, 1754, 263, 17838, 10097, 373, 263, 1023, 732, 29899, 29992, 760, 12720, 15232, 310, 278, 11456, 3652, 22552, 9292, 278, 16992, 1919, 5643, 491, 385, 10097, 373, 278, 11456, 3652, 6298, 24759, 943, 297, 3979, 29871, 29906, 29900, 29900, 29947, 869, 940, 750, 263, 1162, 1038, 292, 6297, 297, 3006, 23238, 310, 278, 11456, 3652, 6960, 950, 1017, 297, 29871, 29906, 29900, 29896, 29900, 1919, 408, 376, 16540, 1489, 29876, 13859, 14246, 2276, 376, 869, 12476, 1896, 261, 5810, 1127, 297, 278, 29871, 29906, 29900, 29896, 29896, 2706, 4702, 10278, 4314, 10624, 491, 3681, 10255, 2034, 869, 29871, 13, 29871, 13, 353, 353, 15825, 353, 353, 29871, 13, 29871, 13, 29871, 13, 353, 353, 353, 29871, 29906, 29900, 29900, 29900, 785, 29871, 29906]

Is this expected and why?
I can share a minimal code example here or in a new issue.

@Any-Winter-4079
Copy link
Author

Any-Winter-4079 commented May 7, 2024

I can't reproduce. This works on latest master with M2 Ultra (both CPU and GPU):

make -j && ./llava-cli -m ./models/llava-7b-v1.6/ggml-model-f16.gguf --mmproj ./models/llava-7b-v1.6/mmproj-model-f16.gguf --image ~/Downloads/cat.png -p "What is the animal doing?" --temp 0.0 -ngl 99

@turian Don't think this is related to tokenization changes, since LLaMA 1 and 2, Mistral and Phi-3 all use what we call SPM tokenizer - i.e. no pre-tokenization is done. The tokenization changes only affect models using BPE tokenizer

The issue happened on the latest (928e0b7) commit that day. Can you try that one if you want?

, g m is d is,,' has r h and to, the a is the to un, and is c,' has do is with is l in is, in a is and has re, is at.. is, is he the b to > t to with. the on' d is at s is. with un and with at t t d un d s- l is st a., is for is to t to d. a with b has for is m d is t as the > is in in on to to as the a is to p s set new g is has is in he is d to the, pro to is is new t t for st for is., is, for. t. set b in l > st as s, the, has at t for # se and the in a is r, the t d is a,, l a has,. c re un and/ a r on in a., the a with for a, in,. for', a with,. for m a and/ d at n at > on is for is g do in on re,. re c r a un is-- is- new-- is, is s.. p

On today's latest commit, it works.

The image shows a pink X symbol. It's a simple, two-dimensional graphic with a solid color and a clear, bold outline. The X is centered in the image, with no additional context or objects present. The image is minimalistic, focusing solely on the X symbol. There's no text or other elements visible in the image.

And a previous commit (8a56075) had it working as well.

I guess the issue is solved now, although if the source was found it may be useful, to prevent reintroduction in the future. I'm pretty sure it was commit f4dea7d that introduced it.

Update: Actually, nevermind and apologies, it does work (even) on the referenced (928e0b7, f4dea7d) commit(s). For some reason (arghhhh), I or some AI tool must have inadvertently removed make -j && from my code base, which messed up the execution after git pull'ing. Apologies 🫣 and closing the issue 😓 -and thank you for your work!

@ggerganov
Copy link
Owner

@turian You can open another issue, but I just verified that TinyLlama tokenization is correct using latest master.

Here are the steps:

  • Apply this patch:
diff --git a/convert-hf-to-gguf-update.py b/convert-hf-to-gguf-update.py
index a26f45a5..4fbec9f1 100755
--- a/convert-hf-to-gguf-update.py
+++ b/convert-hf-to-gguf-update.py
@@ -70,6 +70,7 @@ models = [
     {"name": "qwen2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen1.5-7B", },
     {"name": "olmo",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/allenai/OLMo-1.7-7B-hf", },
     {"name": "dbrx",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/databricks/dbrx-base", },
+    {"name": "tinyllama",      "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0", },
 ]
 
 # make directory "models/tokenizers" if it doesn't exist
  • Run:
python3 convert-hf-to-gguf-update.py <hf_token>

python3 convert-hf-to-gguf.py models/tokenizers/tinyllama/ --outfile models/ggml-vocab-tinyllama.gguf --vocab-only

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-tinyllama.gguf
llama_model_loader: loaded meta data with 23 key-value pairs and 0 tensors from ./models/ggml-vocab-tinyllama.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama
llama_model_loader: - kv   2:                          llama.block_count u32              = 22
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 0.00 K
llm_load_print_meta: model size       = 0.00 MiB (nan BPW) 
llm_load_print_meta: general.name     = tinyllama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1

src: ''
res: ''
tok: 

src: '	'
res: ' 	'
tok: 29871 12 

src: '	
'
res: ' 	
'
tok: 29871 12 13 

src: '
'
res: ' 
'
tok: 29871 13 

src: '

'
res: ' 

'
tok: 29871 13 13 

src: '


'
res: ' 


'
tok: 29871 13 13 13 

src: '
 

 


 				
  
   
    
     
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL'
res: ' 
 

 


 				
  
   
    
     
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL'
tok: 29871 13 29871 13 13 29871 13 13 13 29871 12 29871 12 12 29871 12 13 259 13 1678 13 268 13 418 13 243 162 157 131 313 8945 29897 29871 243 162 155 185 30722 243 162 143 174 30598 313 20787 953 3848 275 16125 630 29897 29871 31681 29871 243 162 169 156 243 162 169 156 29871 29941 29871 29941 29941 29871 29941 29941 29941 29871 29941 29941 29941 29941 29871 29941 29941 29941 29941 29941 29871 29941 29941 29941 29941 29941 29941 29871 29941 29941 29941 29941 29941 29941 29941 29871 29941 29941 29941 29941 29941 29941 29941 29941 29871 29941 29889 29941 29871 29941 636 29941 29871 29941 856 29941 29871 31849 31324 31934 228 162 142 228 161 146 228 162 133 228 161 153 228 161 186 31708 228 162 132 31708 228 161 165 31324 228 161 136 243 162 155 132 1577 30672 31522 30505 11548 31041 30732 29896 29941 29896 29946 29896 29945 29896 30408 30739 448 23648 2751 25512 1538 4851 665 1386 29713 1305 14550 4907 11120 16159 16159 16159 15945 15945 3045 636 6824 6824 6824 8773 8773 8773 306 29915 345 1063 525 29873 1025 540 29915 29879 727 29892 525 1525 366 1854 29973 525 29924 451 1854 306 29915 645 1207 372 29892 525 29928 366 763 777 23429 29973 1334 29915 29963 29872 263 29915 29880 29931 

src: '
 ='
res: ' 
 ='
tok: 29871 13 353 

src: ' '
res: '  '
tok: 259 

src: '  '
res: '   '
tok: 1678 

src: '   '
res: '    '
tok: 268 

src: '    Hello'
res: '     Hello'
tok: 268 15043 

src: '    Hello
    Hello'
res: '     Hello
    Hello'
tok: 268 15043 13 1678 15043 

src: '   Hello'
res: '    Hello'
tok: 1678 15043 

src: '  Hello'
res: '   Hello'
tok: 259 15043 

src: ' ('
res: '  ('
tok: 29871 313 

src: ' Hello'
res: '  Hello'
tok: 29871 15043 

src: ' Hello World'
res: '  Hello World'
tok: 29871 15043 2787 

src: ' Hello World!'
res: '  Hello World!'
tok: 29871 15043 2787 29991 

src: ' Hello world'
res: '  Hello world'
tok: 29871 15043 3186 

src: ' Hello, world!'
res: '  Hello, world!'
tok: 29871 15043 29892 3186 29991 

src: ' this is 🦙.cpp'
res: '  this is 🦙.cpp'
tok: 29871 445 338 29871 243 162 169 156 29889 8223 

src: '' era'
res: ' ' era'
tok: 525 3152 

src: '3'
res: ' 3'
tok: 29871 29941 

src: '33'
res: ' 33'
tok: 29871 29941 29941 

src: '333'
res: ' 333'
tok: 29871 29941 29941 29941 

src: '3333'
res: ' 3333'
tok: 29871 29941 29941 29941 29941 

src: '33333'
res: ' 33333'
tok: 29871 29941 29941 29941 29941 29941 

src: '333333'
res: ' 333333'
tok: 29871 29941 29941 29941 29941 29941 29941 

src: '3333333'
res: ' 3333333'
tok: 29871 29941 29941 29941 29941 29941 29941 29941 

src: '33333333'
res: ' 33333333'
tok: 29871 29941 29941 29941 29941 29941 29941 29941 29941 

src: '333333333'
res: ' 333333333'
tok: 29871 29941 29941 29941 29941 29941 29941 29941 29941 29941 

src: 'Führer'
res: ' Führer'
tok: 383 4000 261 

src: 'Hello'
res: ' Hello'
tok: 15043 

src: 'Hello World'
res: ' Hello World'
tok: 15043 2787 

src: 'Hello world'
res: ' Hello world'
tok: 15043 3186 

src: 'Hello, world!'
res: ' Hello, world!'
tok: 15043 29892 3186 29991 

src: 'Hello, y'all! How are you 😁 ?我想在apple工作1314151天~'
res: ' Hello, y'all! How are you 😁 ?我想在apple工作1314151天~'
tok: 15043 29892 343 29915 497 29991 1128 526 366 29871 243 162 155 132 1577 30672 31522 30505 11548 31041 30732 29896 29941 29896 29946 29896 29945 29896 30408 30739 

src: 'ied 4 ½ months'
res: ' ied 4 ½ months'
tok: 474 287 29871 29946 29871 30226 7378 

src: 'w048 7tuijk dsdfhu'
res: ' w048 7tuijk dsdfhu'
tok: 281 29900 29946 29947 29871 29955 9161 13535 18031 2176 6905 

src: 'нещо на Български'
res: ' нещо на Български'
tok: 1538 4851 665 1386 29713 1305 

src: 'កាន់តែពិសេសអាចខលចេញ'
res: ' កាន់តែពិសេសអាចខលចេញ'
tok: 29871 31849 31324 31934 228 162 142 228 161 146 228 162 133 228 161 153 228 161 186 31708 228 162 132 31708 228 161 165 31324 228 161 136 228 161 132 228 161 158 228 161 136 228 162 132 228 161 140 

src: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
res: ' 🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
tok: 29871 243 162 157 131 313 8945 29897 29871 243 162 155 185 30722 243 162 143 174 30598 313 20787 953 3848 275 16125 630 29897 29871 31681 313 6194 953 29877 2397 393 756 967 1914 5993 29897 

Tests passed

@turian
Copy link

turian commented May 8, 2024

@ggerganov My issue is that an OLD previously converted TinyLlama GGUF a) has buggy tokenization b) doesn't provide any warning.

Is there any workaround for getting old GGUF files to work, rather than creating new GGUF files from HF?

@teleprint-me
Copy link
Contributor

Maybe a version bump every time models break? Could do major.minor.revision and update the minor most of time? Update the major for breaking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants