llama3 custom regex split #6965

jaime-m-p · 2024-04-28T17:52:29Z

Implementation of unicode_regex_split_custom_llama3().

jaime-m-p · 2024-05-04T23:30:48Z

As you can see in commits:

Renamed unicode_ranges_whitespace to unicode_ranges_separator for aligning with unicode category name.
Renamed CODEPOINT_TYPE_WHITESPACE to CODEPOINT_TYPE_SEPARATOR for aligning with unicode category name.
Then I built the table unicode_ranges_whitespace matching \s codepoints, used in unicode_cpt_is_whitespace().
Rebuilt the table unicode_ranges_control fixing the problem with unicode BOM (see below).

Finally, unicode_regex_split_custom_llama3() is working again fixing \s regexs.

In gen-unicode-data.py, the str.decode() is failing with the unicode BOM \uFEFF:

bytes([0xFF,0xFD,0,0]).decode("utf-32")  # Ok, returns '\ufdff'
bytes([0xFF,0xFE,0,0]).decode("utf-32")  # Fail, returns empty string '', expected '\ufeff'
bytes([0xFF,0xFF,0,0]).decode("utf-32")  # Ok, returns '\uffff'
bytes([0xFE,0xFE,0,0]).decode("utf-32")  # Ok, returns '\ufefe'

So I used ord() and chr() for encoding and decoding codepoints.

jaime-m-p · 2024-05-04T23:35:50Z

Also I'm looking for generating the table unicode_nfd_ranges.
I noticed that only the first codepoint is used, so no need to store in a multimap.
But somehow the table is much longer now than before.
When normalizing, there are many ranges mapped to the same NFD codepoint, so it's possible to encode the table with ranges: range_start, range_last, codepoint_nfd.
Doing so the table becomes smaller (<1500 entries).

jaime-m-p · 2024-05-04T23:38:40Z

We don't have to use wikitext specifically - if we generate or find a more diverse dataset, we can use that for tokenization tests

Maybe we can generate a not-so-random corpus trying to cover all category tokens.
Something like generating tokens from regex ...
Let me thing about this, but I'm not sure how much it can grow.

jaime-m-p · 2024-05-04T23:40:38Z

Instead of doing:

#define CODEPOINT_TYPE_UNIDENTIFIED 0
#define CODEPOINT_TYPE_NUMBER       1
#define CODEPOINT_TYPE_LETTER       2
#define CODEPOINT_TYPE_SEPARATOR    3
#define CODEPOINT_TYPE_ACCENT_MARK  4
#define CODEPOINT_TYPE_PUNCTUATION  5
#define CODEPOINT_TYPE_SYMBOL       6
#define CODEPOINT_TYPE_CONTROL      7

int unicode_cpt_type(uint32_t cp);
int unicode_cpt_type(const std::string & utf8);

What do you think about using bitfields?

struct unicode_codepoint_flags {
    // CODEPOINT_TYPE_*
    uint8_t is_undefined   : 1;
    uint8_t is_number      : 1;  // regex: \p{N}
    uint8_t is_letter      : 1;  // regex: \p{L}
    uint8_t is_separator   : 1;  // regex: \p{Z}
    uint8_t is_accent_mark : 1;  // regex: \p{M}
    uint8_t is_punctuation : 1;  // regex: \p{P}
    uint8_t is_symbol      : 1;  // regex: \p{S}
    uint8_t is_control     : 1;  // regex: \p{C}
    // helper flags
    uint8_t is_whitespace  : 1;  // regex: \s
    uint8_t is_lowercase   : 1;
    uint8_t is_uppercase   : 1;
    uint8_t is_nfd         : 1;
};

unicode_codepoint_flags unicode_cpt_flags(uint32_t cp);
unicode_codepoint_flags unicode_cpt_flags(const std::string & utf8);

This way we can track in parallel the unicode category and additional useful flags like \s.

ggerganov · 2024-05-06T05:18:28Z

Let's change the target branch to master

reneleonhardt · 2024-05-06T06:49:42Z

@jaime-m-p good solution for bruteforce tests is coverage-guided fuzzing, see https://llvm.org/docs/LibFuzzer.html Founded problems may later be converted to unit tests.

A regex engine is forbidden because parsing by hand and syncing unicode tables is easier, but a huge library with dozens of files is an acceptable single-header/single-source solution? 🤔
https://github.com/llvm/llvm-project/tree/main/compiler-rt/lib/fuzzer

ggerganov · 2024-05-08T11:54:54Z

What do you think about using bitfields?

Yes, seems like a good idea

p.s. I changed the target branch to master

jaime-m-p · 2024-05-08T21:58:23Z

Threre are still models failing (bert, phi-3) but are not related to this PR (llama-3).

p.s. I changed the target branch to master

Ok, im going to open another with a new gen-unicode-data.py proposal using bitfields.

ggerganov

I discovered fail case using llama-bpe: Cửa Việt

src: 'Cửa Việt'
res: 'Cửa Việt'
tok: 34 91163 11655 26298 83 
main : failed test:    'Cửa Việt'
main : detokenized to: 'Cửa Việt' instead of 'Cửa Việt'
main : expected tokens:     34 'C',  91163 'ửa', 101798 ' Việt', 
main : got tokens:          34 'C',  91163 'ửa',  11655 ' Vi',  26298 'ệ',     83 't',

This also fails on master

Other than that, I think this PR is great - nice job!
I couldn't test the random tokenizer tests because I'm on Mac atm, but will give this a try later on my Linux box.

Will merge this after the CI is green

mofosyne · 2024-05-09T13:30:36Z

gg's intent is to merge after the CI is green... all check has passsed... merging

jaime-m-p · 2024-05-11T02:44:09Z

I discovered fail case using llama-bpe: Cửa Việt

Oh, that one was not easy to fix.

The core of the problem is this missing if:
https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/models/bpe/model.rs#L466

When the ignore_merges variable is true it searches the word in the vocab, skipping all bpe merges.
We ended up with three sub-tokens instead of one, because there is no way to build Cửa Việt merging sub-tokens from tokenizer.json.

The ignore_merges variable is a config value inside tokenizer.json:

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": true,
    ...
  }

Not sure if we need recover this variable from the gguf file, or just hardcode to true.

This is a patch to check that it fixes the problem (llama-3 and gpt-2 passed the brute force tests):

diff --git a/llama.cpp b/llama.cpp
index da7c0d9f..73deb462 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12298,6 +12298,12 @@ struct llm_tokenizer_bpe {
             int index = 0;
             size_t offset = 0;
 
+            static const bool ignore_merges = true;
+            if(ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
+                symbols.emplace_back(llm_symbol{-1, -1, word.c_str(), word.size()});
+                offset = word.size();
+            }
+
             while (offset < word.size()) {
                 llm_symbol sym;
                 size_t char_len = std::min(word.size() - offset, (size_t) ::utf8_len(word[offset]));

Also I realized my brute force test is wrong. My intention here is to check all vocab words, but I'm stripping whitespaces... fix to come.

jaggzh and others added 30 commits April 26, 2024 11:43

merged the changes from deepseeker models to main branch

6fbab2d

Moved regex patterns to unicode.cpp and updated unicode.h

d2cfc22

Moved header files

54f93eb

Resolved issues

1c924e4

added and refactored unicode_regex_split and related functions

4056dc5

Updated/merged the deepseek coder pr

c8e7d95

Refactored code

4c3e882

Adding unicode regex mappings

a5710a4

Adding unicode regex function

7e308ed

Added needed functionality, testing remains

feeaf4f

Fixed issues

7535803

Fixed issue with gpt2 regex custom preprocessor

36d9832

unicode : fix? unicode_wstring_to_utf8

06d3e69

lint : fix whitespaces

c56e19d

tests : add tokenizer tests for numbers

7a44e44

unicode : remove redundant headers

d999cf6

tests : remove and rename tokenizer test scripts

aeafb43

tests : add sample usage

e1b2bf7

gguf-py : reader prints warnings on duplicate keys

ed42711

llama : towards llama3 tokenization support (wip)

4907e41

unicode : shot in the dark to fix tests on Windows

e8c206b

unicode : first try custom implementations

e989176

Merge branch 'master' into gg/bpe-preprocess

e3f6dc7

convert : add "tokenizer.ggml.pre" GGUF KV (wip)

9b4d63a

llama : use new pre-tokenizer type

43e12ce

convert : fix pre-tokenizer type writing

1b9b79d

lint : fix

8791e94

make : add test-tokenizer-0-llama-v3

a774d70

wip

c160818

models : add llama v3 vocab file

96965f6

Restore BOM

edf375d

jaime-m-p added 3 commits May 7, 2024 20:38

Style

a5fa2fe

wip: generate NDF table

def3d13

Ignore special tokens for testing

7761f8e

ggerganov changed the base branch from gg/bpe-preprocess to master May 8, 2024 11:52

jaime-m-p added 2 commits May 8, 2024 23:27

Clean gen-unicode-data.py

70ca1fe

Refactor random tokenizer test

77cbb79

jaime-m-p closed this May 8, 2024

jaime-m-p reopened this May 8, 2024

Merge branch 'master' into gg/bpe-preprocess

ea47119

ggerganov self-requested a review May 9, 2024 06:41

ggerganov added 2 commits May 9, 2024 09:45

lint : fix

8de8b6d

tests : add fail test for llama-bpe

12a7b69

ggerganov force-pushed the gg/bpe-preprocess branch from 9d346d0 to 12a7b69 Compare May 9, 2024 07:29

ggerganov approved these changes May 9, 2024

View reviewed changes

mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024

mofosyne merged commit 43248e5 into ggerganov:master May 9, 2024
65 checks passed

ggerganov mentioned this pull request May 11, 2024

fix : lookup word in vocab before doing BPE merges #7193

Merged

dpublic mentioned this pull request May 13, 2024

Some Ollama models apparently affected by llama.cpp BPE pretokenization issue ollama/ollama#4126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3 custom regex split #6965

llama3 custom regex split #6965

jaime-m-p commented Apr 28, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

ggerganov commented May 6, 2024

reneleonhardt commented May 6, 2024

ggerganov commented May 8, 2024

jaime-m-p commented May 8, 2024 •

edited

ggerganov left a comment

mofosyne commented May 9, 2024

jaime-m-p commented May 11, 2024

llama3 custom regex split #6965

llama3 custom regex split #6965

Conversation

jaime-m-p commented Apr 28, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

jaime-m-p commented May 4, 2024

ggerganov commented May 6, 2024

reneleonhardt commented May 6, 2024

ggerganov commented May 8, 2024

jaime-m-p commented May 8, 2024 • edited

ggerganov left a comment

Choose a reason for hiding this comment

mofosyne commented May 9, 2024

jaime-m-p commented May 11, 2024

jaime-m-p commented May 8, 2024 •

edited