Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new tokenizer-verifier tool to check gguf tokenizer parameters #6988

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anisse
Copy link

@anisse anisse commented Apr 29, 2024

This program verifies that a given gguf model file can tokenize all potential valid characters. Since llama.cpp currently raises an exception when tokenization is not possible, this tool helps verifying that valid ascii and utf-8 will always be properly tokenized.

@anisse
Copy link
Author

anisse commented Apr 29, 2024

Here is what it looks like for a model with a BPE tokenizer that can tokenize anything:

> ./build/bin/tokenizer-verifier  ../models/llama-2-7b-chat.Q2_K.gguf 2>/dev/null
0/127 7-bit ascii characters could not be tokenized
0/1114111 potential unicode codepoints not tokenized

And for another where this fails because of an incomplete (or misconfigured?) tokenizer:

> ./build/bin/tokenizer-verifier  ../models/croissantllmchat-v0.1.Q8_0.gguf 2> /dev/null
0x1 -> Tokenization failed for char ''
0x2 -> Tokenization failed for char ''
0x3 -> Tokenization failed for char ''
0x4 -> Tokenization failed for char ''
0x5 -> Tokenization failed for char ''
0x6 -> Tokenization failed for char ''
0x7 -> Tokenization failed for char ''
0x8 -> Tokenization failed for char '
0xb -> Tokenization failed for char '
                                     '
0xc -> Tokenization failed for char '
                                     '
0xe -> Tokenization failed for char ''
0xf -> Tokenization failed for char ''
0x10 -> Tokenization failed for char ''
0x11 -> Tokenization failed for char ''
0x12 -> Tokenization failed for char ''
0x13 -> Tokenization failed for char ''
0x14 -> Tokenization failed for char ''
0x15 -> Tokenization failed for char ''
0x16 -> Tokenization failed for char ''
0x17 -> Tokenization failed for char ''
0x18 -> Tokenization failed for char '▒'
0x19 -> Tokenization failed for char ''
0x1a -> Tokenization failed for char '▒'
0x1b -> Tokenization failed for char '
x1c -> Tokenization failed for char ''
0x1d -> Tokenization failed for char ''
0x1e -> Tokenization failed for char ''
0x1f -> Tokenization failed for char ''
0x7f -> Tokenization failed for char ''
29/127 7-bit ascii characters could not be tokenized
1113111/1114111 potential unicode codepoints not tokenized

Note that recent changes have made it very slow fol llama3 (or maybe it's my gguf file?).

This program verifies that a given gguf model file can tokenize all
potential valid characters. Since llama.cpp currently raises an
exception when tokenization is not possible[1], this tool helps
verifying that valid ascii and utf-8 will always be properly tokenized.

[1] ggerganov#2580
@mofosyne mofosyne added enhancement New feature or request review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 9, 2024
Copy link
Collaborator

@mofosyne mofosyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that other tools in example folder has README.md, I think we need one for this as well, since it's for a specific purpose as a tool.

@mofosyne
Copy link
Collaborator

mofosyne commented May 13, 2024

$ ./build/bin/tokenizer-verifier ./models/ggml-vocab-aquila.gguf 
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
llama_model_loader: loaded meta data with 18 key-value pairs and 0 tensors from ./models/ggml-vocab-aquila.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = D:\Diverses\models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,100008]  = ["<|endoftext|>", "!", "\"", "#", "$"...
llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr[f32,100008]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,100008]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,99743]   = ["Ġ Ġ", "ä ¸", "Ġ t", "ï ¼", "...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: mismatch in special tokens definition ( 9/100008 vs 8/100008 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100008
llm_load_print_meta: n_merges         = 99743
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 0.00 K
llm_load_print_meta: model size       = 0.00 MiB (-nan BPW) 
llm_load_print_meta: general.name     = D:\Diverses\models
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 129 'Ä'
llama_model_load: vocab only - skipping tensors
0/127 7-bit ascii characters could not be tokenized

... hangs around here?

Also doesn't seem to handle missing models well, ends up segfaulting.

$ ./build/bin/tokenizer-verifier ./models/ggml-vocab-aquil3 
llama_model_load: error loading model: llama_model_loader: failed to load model from ./models/ggml-vocab-aquil3

llama_load_model_from_file: failed to load model
Segmentation fault (core dumped)

This PR might be a bit more adjusting.

@mofosyne mofosyne marked this pull request as draft May 13, 2024 14:40
@anisse
Copy link
Author

anisse commented May 14, 2024

Thanks a lot for your review @mofosyne, I'll add a README in the next iteration.

... hangs around here?

I don't think it hanged, but it illustrates the issue I talked about earlier:

Note that recent changes have made it very slow fol llama3 (or maybe it's my gguf file?).

Something made the tokenization very slow, but I don't know what. I might bisect it to find the issue.

Also doesn't seem to handle missing models well, ends up segfaulting.

If you look at the code, it loads the models in a very simple and straightforward way, just like the tokenize example; I'll check why it segfaults, but I wouldn't be surprised if it just ends up exposing an actual issue in the llama.cpp API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants