tests : add test-tokenizer-0.sh #7036

ggerganov · 2024-05-02T05:50:53Z

Add more extensive tokenizer test that takes a text file, tokenizes it using transformers and llama.cpp and compares the results.

# run once
python3 convert-hf-to-gguf-update.py <hf_token>

# tests OK
./tests/test-tokenizer-0.sh llama-spm ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh llama-bpe ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh gpt-2     ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh phi-3     ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh starcoder ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh falcon    ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh refact    ./build/wikitext-2-raw/wiki.train.raw

# tests Fail
./tests/test-tokenizer-0.sh deepseek-llm   ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh deepseek-coder ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh mpt            ./build/wikitext-2-raw/wiki.train.raw

Need to find the reason why the tokenization differs in the Fail cases. For example, DeepSeek models fail like this:

make -j tests/test-tokenizer-0 && ./tests/test-tokenizer-0 ./models/ggml-vocab-deepseek-coder.gguf

src: 'Führer'
res: 'Führer'
tok: 37 2864 71 6247 
main : failed test:    'Führer'
main : detokenized to: 'Führer' instead of 'Führer'
main : expected tokens:     37 'F',  32009 'ü',     71 'h',   6247 'rer', 
main : got tokens:          37 'F',   2864 'ü',     71 'h',   6247 'rer',

Added script for generating the unicode ranges in unicode-data.cpp:

python3 scripts/gen-unicode-data.py

github-actions · 2024-05-02T09:23:24Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8554.04ms p(95)=20486.45ms fails=, finish reason: stop=484 truncated=63
Prompt processing (pp): avg=99.29tk/s p(95)=413.58tk/s
Token generation (tg): avg=33.5tk/s p(95)=49.64tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/add-tokenizer-test-script commit=7e11d409fa2fc1868fa04c5e02d905b8499f2a66

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 293.76, 293.76, 293.76, 293.76, 293.76, 784.9, 784.9, 784.9, 784.9, 784.9, 704.71, 704.71, 704.71, 704.71, 704.71, 740.89, 740.89, 740.89, 740.89, 740.89, 802.3, 802.3, 802.3, 802.3, 802.3, 819.16, 819.16, 819.16, 819.16, 819.16, 816.77, 816.77, 816.77, 816.77, 816.77, 832.12, 832.12, 832.12, 832.12, 832.12, 836.9, 836.9, 836.9, 836.9, 836.9, 851.6, 851.6, 851.6, 851.6, 851.6, 851.75, 851.75, 851.75, 851.75, 851.75, 865.93, 865.93, 865.93, 865.93, 865.93, 906.25, 906.25, 906.25, 906.25, 906.25, 936.38, 936.38, 936.38, 936.38, 936.38, 958.16, 958.16, 958.16, 958.16, 958.16, 949.06, 949.06, 949.06, 949.06, 949.06, 949.71, 949.71, 949.71, 949.71, 949.71, 943.79, 943.79, 943.79, 943.79, 943.79, 961.23, 961.23, 961.23, 961.23, 961.23, 957.17, 957.17, 957.17, 957.17, 957.17, 950.7, 950.7, 950.7, 950.7, 950.7, 955.53, 955.53, 955.53, 955.53, 955.53, 954.9, 954.9, 954.9, 954.9, 954.9, 962.32, 962.32, 962.32, 962.32, 962.32, 963.72, 963.72, 963.72, 963.72, 963.72, 960.09, 960.09, 960.09, 960.09, 960.09, 960.5, 960.5, 960.5, 960.5, 960.5, 943.24, 943.24, 943.24, 943.24, 943.24, 937.15, 937.15, 937.15, 937.15, 937.15, 934.73, 934.73, 934.73, 934.73, 934.73, 933.46, 933.46, 933.46, 933.46, 933.46, 936.26, 936.26, 936.26, 936.26, 936.26, 934.4, 934.4, 934.4, 934.4, 934.4, 935.02, 935.02, 935.02, 935.02, 935.02, 938.66, 938.66, 938.66, 938.66, 938.66, 948.97, 948.97, 948.97, 948.97, 948.97, 948.26, 948.26, 948.26, 948.26, 948.26, 922.19, 922.19, 922.19, 922.19, 922.19, 920.76, 920.76, 920.76, 920.76, 920.76, 922.27, 922.27, 922.27, 922.27, 922.27, 923.73, 923.73, 923.73, 923.73, 923.73, 932.84, 932.84, 932.84, 932.84, 932.84, 932.86, 932.86, 932.86, 932.86, 932.86, 920.65, 920.65, 920.65, 920.65, 920.65, 918.74, 918.74, 918.74, 918.74, 918.74, 916.54, 916.54, 916.54, 916.54, 916.54, 914.66, 914.66, 914.66, 914.66, 914.66, 920.67, 920.67, 920.67, 920.67, 920.67, 919.4, 919.4, 919.4, 919.4, 919.4, 922.01, 922.01, 922.01, 922.01, 922.01, 920.96, 920.96, 920.96, 920.96, 920.96, 923.79, 923.79, 923.79, 923.79, 923.79, 925.11, 925.11, 925.11, 925.11, 925.11, 923.1, 923.1, 923.1, 923.1, 923.1, 928.52, 928.52, 928.52, 928.52, 928.52, 928.62, 928.62, 928.62, 928.62, 928.62, 927.87, 927.87, 927.87, 927.87, 927.87, 928.37, 928.37, 928.37, 928.37, 928.37, 928.84, 928.84, 928.84, 928.84, 928.84, 928.56, 928.56, 928.56, 928.56, 928.56, 930.31, 930.31, 930.31, 930.31, 930.31]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 39.63, 39.63, 39.63, 39.63, 39.63, 42.71, 42.71, 42.71, 42.71, 42.71, 29.67, 29.67, 29.67, 29.67, 29.67, 30.53, 30.53, 30.53, 30.53, 30.53, 32.18, 32.18, 32.18, 32.18, 32.18, 32.54, 32.54, 32.54, 32.54, 32.54, 33.81, 33.81, 33.81, 33.81, 33.81, 34.3, 34.3, 34.3, 34.3, 34.3, 34.73, 34.73, 34.73, 34.73, 34.73, 34.81, 34.81, 34.81, 34.81, 34.81, 34.7, 34.7, 34.7, 34.7, 34.7, 34.36, 34.36, 34.36, 34.36, 34.36, 33.36, 33.36, 33.36, 33.36, 33.36, 33.14, 33.14, 33.14, 33.14, 33.14, 32.65, 32.65, 32.65, 32.65, 32.65, 31.72, 31.72, 31.72, 31.72, 31.72, 31.76, 31.76, 31.76, 31.76, 31.76, 32.1, 32.1, 32.1, 32.1, 32.1, 32.05, 32.05, 32.05, 32.05, 32.05, 31.68, 31.68, 31.68, 31.68, 31.68, 31.12, 31.12, 31.12, 31.12, 31.12, 31.13, 31.13, 31.13, 31.13, 31.13, 31.21, 31.21, 31.21, 31.21, 31.21, 31.33, 31.33, 31.33, 31.33, 31.33, 31.04, 31.04, 31.04, 31.04, 31.04, 31.19, 31.19, 31.19, 31.19, 31.19, 31.24, 31.24, 31.24, 31.24, 31.24, 31.31, 31.31, 31.31, 31.31, 31.31, 30.88, 30.88, 30.88, 30.88, 30.88, 30.85, 30.85, 30.85, 30.85, 30.85, 31.05, 31.05, 31.05, 31.05, 31.05, 31.23, 31.23, 31.23, 31.23, 31.23, 31.31, 31.31, 31.31, 31.31, 31.31, 31.46, 31.46, 31.46, 31.46, 31.46, 31.54, 31.54, 31.54, 31.54, 31.54, 31.5, 31.5, 31.5, 31.5, 31.5, 31.37, 31.37, 31.37, 31.37, 31.37, 31.34, 31.34, 31.34, 31.34, 31.34, 31.45, 31.45, 31.45, 31.45, 31.45, 31.63, 31.63, 31.63, 31.63, 31.63, 31.76, 31.76, 31.76, 31.76, 31.76, 31.75, 31.75, 31.75, 31.75, 31.75, 31.64, 31.64, 31.64, 31.64, 31.64, 31.56, 31.56, 31.56, 31.56, 31.56, 30.87, 30.87, 30.87, 30.87, 30.87, 30.22, 30.22, 30.22, 30.22, 30.22, 29.88, 29.88, 29.88, 29.88, 29.88, 29.83, 29.83, 29.83, 29.83, 29.83, 29.91, 29.91, 29.91, 29.91, 29.91, 29.98, 29.98, 29.98, 29.98, 29.98, 30.11, 30.11, 30.11, 30.11, 30.11, 30.16, 30.16, 30.16, 30.16, 30.16, 30.17, 30.17, 30.17, 30.17, 30.17, 29.96, 29.96, 29.96, 29.96, 29.96, 29.9, 29.9, 29.9, 29.9, 29.9, 29.94, 29.94, 29.94, 29.94, 29.94, 30.1, 30.1, 30.1, 30.1, 30.1, 30.17, 30.17, 30.17, 30.17, 30.17, 30.28, 30.28, 30.28, 30.28, 30.28, 30.34, 30.34, 30.34, 30.34, 30.34, 30.35, 30.35, 30.35, 30.35, 30.35]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.39, 0.39, 0.39, 0.39, 0.39, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.25, 0.25, 0.25, 0.25, 0.25, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.49, 0.49, 0.49, 0.49, 0.49, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0]

ggerganov · 2024-05-03T08:42:29Z

I think there is a bug in the way we handle added tokens. I'm experimenting with DeepSeek-Coder:

https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base

llama.cpp tokenizes ü to 2864 which is OK, but there is also the added token 32009 which transformers tokenizer selects instead:

    {
      "id": 32009,
      "content": "ü",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },

If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

Any ideas how to fix this?

CISC · 2024-05-03T10:28:14Z

If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

I believe the issue is this, added tokens are always looked up first.

Any ideas how to fix this?

AFAICT the only way to fix this is to add added tokens to the GGUF separately, which will be esp. complicated if the added tokens are merged to the middle of the existing vocab (otherwise just adding an index to the beginning of added tokens would be enough).

ggerganov · 2024-05-03T13:27:40Z

From what I've found, the problem seem to be that as part of the pre-tokenization, we perform some byte-to-unicode mapping here:

llama.cpp/unicode.cpp

Lines 213 to 217 in 3275e60

    
           std::string encoded_token; 
        
           for (char & c : text_utf) { 
        
               encoded_token += unicode_byte_to_utf8(c); 
        
           } 
        
           bpe_encoded_words.emplace_back(encoded_token);

llama.cpp/unicode.cpp

Lines 151 to 173 in 3275e60

    
           static std::unordered_map<uint8_t, std::string> unicode_byte_to_utf8_map() { 
        
               std::unordered_map<uint8_t, std::string> map; 
        
               for (int ch = u'!'; ch <= u'~'; ++ch) { 
        
                   assert(0 <= ch && ch < 256); 
        
                   map[ch] = unicode_cpt_to_utf8(ch); 
        
               } 
        
               for (int ch = u'¡'; ch <= u'¬'; ++ch) { 
        
                   assert(0 <= ch && ch < 256); 
        
                   map[ch] = unicode_cpt_to_utf8(ch); 
        
               } 
        
               for (int ch = u'®'; ch <= u'ÿ'; ++ch) { 
        
                   assert(0 <= ch && ch < 256); 
        
                   map[ch] = unicode_cpt_to_utf8(ch); 
        
               } 
        
               auto n = 0; 
        
               for (int ch = 0; ch < 256; ++ch) { 
        
                   if (map.find(ch) == map.end()) { 
        
                       map[ch] = unicode_cpt_to_utf8(256 + n); 
        
                       ++n; 
        
                   } 
        
               } 
        
               return map; 
        
           }

This converts the string ü to the string Ã¼. This new string is exactly the token 2864, which detokenizes to ü via the llama_decode_text() function. The problem is that we don't even consider the token 32009, because ü is not present in the pre-tokenized string.

ggml-ci

teleprint-me · 2024-05-04T00:53:23Z

Why is the upper limit set to 256? Isn't that the ASCII range?

The range of valid Unicode code points is from U+0000 (hexadecimal 0) to U+1FFFFF (hexadecimal FFFF), which covers more than 1 million unique characters.

ggml-ci

ggerganov · 2024-05-04T05:16:29Z

Why is the upper limit set to 256? Isn't that the ASCII range?

This seems to be some strategy to reduce the vocab size:

https://github.com/openai/gpt-2/blob/master/src/encoder.py#L8-L28

teleprint-me · 2024-05-04T19:19:44Z

I find it fascinating how we have a tendency to over-complicate simple ideas. I'm all too guilty of this myself.

# simplified function definition
@lru_cache()
def bytes_to_unicode(size: int = 256) -> dict[int, str]:
    """
    This function generates a dictionary mapping each byte to its corresponding Unicode character.

    :param size: The total number of bytes in the encoding space (default is 256 for ASCII).

    :return: A dictionary containing mappings between bytes and their respective Unicode characters.
    """

    # list of visible characters:
    # (ord("!"), ord("~") + 1); (ord("¡"), ord("¬") + 1); (ord("®"), ord("ÿ") + 1);
    visible = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))

    mapping: dict = {}
    for byte in list(range(size)):
        # convert "visible" characters
        if byte in visible:
            mapping[byte] = chr(byte)
        else:  # translate and convert non-printable characters
            mapping[byte] = chr(byte + size)
    return mapping

where the upper limit can be defined as upper_limit = 2**8 = 256. This should be extendable by choice. So probably allow a variable upper limit depending on the size of the input to reduce time complexity.

output

Get the mapping:

mapping = bytes_to_unicode()
gpt_mapping = gpt_bytes_to_unicode()
for key in mapping.keys():
    assert mapping[key] == gpt_mapping[key]

from pprint import pprint  # pretty print output
pprint(mapping)

Mapping output:

18:30:48 | ~
  λ python -i /tmp/bytes_to_unicode.py
{0: 'Ā',
 1: 'ā',
 2: 'Ă',
 3: 'ă',
 4: 'Ą',
 5: 'ą',
 6: 'Ć',
 7: 'ć',
 8: 'Ĉ',
 9: 'ĉ',
 10: 'Ċ',
 11: 'ċ',
 12: 'Č',
 13: 'č',
 14: 'Ď',
 15: 'ď',
 16: 'Đ',
 17: 'đ',
 18: 'Ē',
 19: 'ē',
 20: 'Ĕ',
 21: 'ĕ',
 22: 'Ė',
 23: 'ė',
 24: 'Ę',
 25: 'ę',
 26: 'Ě',
 27: 'ě',
 28: 'Ĝ',
 29: 'ĝ',
 30: 'Ğ',
 31: 'ğ',
 32: 'Ġ',
 33: '!',
 34: '"',
 35: '#',
 36: '$',
 37: '%',
 38: '&',
 39: "'",
 40: '(',
 41: ')',
 42: '*',
 43: '+',
 44: ',',
 45: '-',
 46: '.',
 47: '/',
 48: '0',
 49: '1',
 50: '2',
 51: '3',
 52: '4',
 53: '5',
 54: '6',
 55: '7',
 56: '8',
 57: '9',
 58: ':',
 59: ';',
 60: '<',
 61: '=',
 62: '>',
 63: '?',
 64: '@',
 65: 'A',
 66: 'B',
 67: 'C',
 68: 'D',
 69: 'E',
 70: 'F',
 71: 'G',
 72: 'H',
 73: 'I',
 74: 'J',
 75: 'K',
 76: 'L',
 77: 'M',
 78: 'N',
 79: 'O',
 80: 'P',
 81: 'Q',
 82: 'R',
 83: 'S',
 84: 'T',
 85: 'U',
 86: 'V',
 87: 'W',
 88: 'X',
 89: 'Y',
 90: 'Z',
 91: '[',
 92: '\\',
 93: ']',
 94: '^',
 95: '_',
 96: '`',
 97: 'a',
 98: 'b',
 99: 'c',
 100: 'd',
 101: 'e',
 102: 'f',
 103: 'g',
 104: 'h',
 105: 'i',
 106: 'j',
 107: 'k',
 108: 'l',
 109: 'm',
 110: 'n',
 111: 'o',
 112: 'p',
 113: 'q',
 114: 'r',
 115: 's',
 116: 't',
 117: 'u',
 118: 'v',
 119: 'w',
 120: 'x',
 121: 'y',
 122: 'z',
 123: '{',
 124: '|',
 125: '}',
 126: '~',
 127: 'ſ',
 128: 'ƀ',
 129: 'Ɓ',
 130: 'Ƃ',
 131: 'ƃ',
 132: 'Ƅ',
 133: 'ƅ',
 134: 'Ɔ',
 135: 'Ƈ',
 136: 'ƈ',
 137: 'Ɖ',
 138: 'Ɗ',
 139: 'Ƌ',
 140: 'ƌ',
 141: 'ƍ',
 142: 'Ǝ',
 143: 'Ə',
 144: 'Ɛ',
 145: 'Ƒ',
 146: 'ƒ',
 147: 'Ɠ',
 148: 'Ɣ',
 149: 'ƕ',
 150: 'Ɩ',
 151: 'Ɨ',
 152: 'Ƙ',
 153: 'ƙ',
 154: 'ƚ',
 155: 'ƛ',
 156: 'Ɯ',
 157: 'Ɲ',
 158: 'ƞ',
 159: 'Ɵ',
 160: 'Ơ',
 161: '¡',
 162: '¢',
 163: '£',
 164: '¤',
 165: '¥',
 166: '¦',
 167: '§',
 168: '¨',
 169: '©',
 170: 'ª',
 171: '«',
 172: '¬',
 173: 'ƭ',
 174: '®',
 175: '¯',
 176: '°',
 177: '±',
 178: '²',
 179: '³',
 180: '´',
 181: 'µ',
 182: '¶',
 183: '·',
 184: '¸',
 185: '¹',
 186: 'º',
 187: '»',
 188: '¼',
 189: '½',
 190: '¾',
 191: '¿',
 192: 'À',
 193: 'Á',
 194: 'Â',
 195: 'Ã',
 196: 'Ä',
 197: 'Å',
 198: 'Æ',
 199: 'Ç',
 200: 'È',
 201: 'É',
 202: 'Ê',
 203: 'Ë',
 204: 'Ì',
 205: 'Í',
 206: 'Î',
 207: 'Ï',
 208: 'Ð',
 209: 'Ñ',
 210: 'Ò',
 211: 'Ó',
 212: 'Ô',
 213: 'Õ',
 214: 'Ö',
 215: '×',
 216: 'Ø',
 217: 'Ù',
 218: 'Ú',
 219: 'Û',
 220: 'Ü',
 221: 'Ý',
 222: 'Þ',
 223: 'ß',
 224: 'à',
 225: 'á',
 226: 'â',
 227: 'ã',
 228: 'ä',
 229: 'å',
 230: 'æ',
 231: 'ç',
 232: 'è',
 233: 'é',
 234: 'ê',
 235: 'ë',
 236: 'ì',
 237: 'í',
 238: 'î',
 239: 'ï',
 240: 'ð',
 241: 'ñ',
 242: 'ò',
 243: 'ó',
 244: 'ô',
 245: 'õ',
 246: 'ö',
 247: '÷',
 248: 'ø',
 249: 'ù',
 250: 'ú',
 251: 'û',
 252: 'ü',
 253: 'ý',
 254: 'þ',
 255: 'ÿ'}
>>>

I'm looking into it though.

* tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update

DOGEwbx · 2024-05-08T05:50:02Z

I think there is a bug in the way we handle added tokens. I'm experimenting with DeepSeek-Coder:

https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base

llama.cpp tokenizes ü to 2864 which is OK, but there is also the added token 32009 which transformers tokenizer selects instead:
    {
      "id": 32009,
      "content": "ü",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

Any ideas how to fix this?
@ggerganov

Hi, I found that current llama.cpp can not pass the unit tests for deepseek models. The problem you mentioned looks like the issue huggingface tokenizers have solved huggingface/tokenizers#1392

for the newly published deepseek v2 and deepseekcoder v1.5, these added tokens are removed.

ggerganov · 2024-05-08T06:38:34Z

@DOGEwbx Thanks - will try deepseek-coder v1.5 then. DS v2 will probably take some time to support #7118

tests : add test-tokenizer-0.sh

ce7d3a0

ggerganov force-pushed the gg/add-tokenizer-test-script branch from 9998b08 to ce7d3a0 Compare May 2, 2024 05:52

ggerganov added 4 commits May 2, 2024 10:59

unicode : add all unicode number ranges

7053b26

starcoder : fix pre-tokenizer

cf00fe1

tests : add test that fails with DeepSeek tokenizers

3a461db

falcon : fix regex

3275e60

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

ggerganov added the high priority Very important issue label May 3, 2024

ggerganov added 5 commits May 3, 2024 17:09

unicode : regenerate unicode tables

cd7c728

refact : add tokenizer model

d53240c

lint : fix

c30056a

tests : disable failing tests

bc26eb7

ggml-ci

refact : add tests files

9745cf8

ggml-ci

ggerganov added 3 commits May 4, 2024 07:51

Merge branch 'master' into gg/add-tokenizer-test-script

26f606e

convert : print -> logging

d974aed

ggml-ci

lint : fix

5f30e30

ggerganov force-pushed the gg/add-tokenizer-test-script branch from 74fa6cd to 5f30e30 Compare May 4, 2024 05:12

ggerganov added 2 commits May 4, 2024 08:21

unicode : digit -> number

f19b45c

phi-3 : update

7e11d40

ggerganov merged commit 92139b9 into master May 4, 2024
58 of 63 checks passed

ggerganov deleted the gg/add-tokenizer-test-script branch May 4, 2024 05:32

This was referenced May 4, 2024

two converter py files needs basicConfig() added #7070

Closed

llama3 custom regex split #6965

Merged

ggerganov mentioned this pull request May 8, 2024

llama : add Deepseek support #5981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests : add test-tokenizer-0.sh #7036

tests : add test-tokenizer-0.sh #7036

ggerganov commented May 2, 2024 •

edited

github-actions bot commented May 2, 2024 •

edited

ggerganov commented May 3, 2024

CISC commented May 3, 2024 •

edited

ggerganov commented May 3, 2024

teleprint-me commented May 4, 2024

ggerganov commented May 4, 2024

teleprint-me commented May 4, 2024 •

edited

DOGEwbx commented May 8, 2024

ggerganov commented May 8, 2024

tests : add test-tokenizer-0.sh #7036

tests : add test-tokenizer-0.sh #7036

Conversation

ggerganov commented May 2, 2024 • edited

github-actions bot commented May 2, 2024 • edited

ggerganov commented May 3, 2024

CISC commented May 3, 2024 • edited

ggerganov commented May 3, 2024

teleprint-me commented May 4, 2024

ggerganov commented May 4, 2024

teleprint-me commented May 4, 2024 • edited

DOGEwbx commented May 8, 2024

ggerganov commented May 8, 2024

ggerganov commented May 2, 2024 •

edited

github-actions bot commented May 2, 2024 •

edited

CISC commented May 3, 2024 •

edited

teleprint-me commented May 4, 2024 •

edited