Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 custom regex split #6965

Merged
merged 88 commits into from May 9, 2024
Merged

Conversation

jaime-m-p
Copy link
Collaborator

Implementation of unicode_regex_split_custom_llama3().

@jaime-m-p
Copy link
Collaborator Author

As you can see in commits:

  • Renamed unicode_ranges_whitespace to unicode_ranges_separator for aligning with unicode category name.
  • Renamed CODEPOINT_TYPE_WHITESPACE to CODEPOINT_TYPE_SEPARATOR for aligning with unicode category name.
  • Then I built the table unicode_ranges_whitespace matching \s codepoints, used in unicode_cpt_is_whitespace().
  • Rebuilt the table unicode_ranges_control fixing the problem with unicode BOM (see below).

Finally, unicode_regex_split_custom_llama3() is working again fixing \s regexs.

In gen-unicode-data.py, the str.decode() is failing with the unicode BOM \uFEFF:

bytes([0xFF,0xFD,0,0]).decode("utf-32")  # Ok, returns '\ufdff'
bytes([0xFF,0xFE,0,0]).decode("utf-32")  # Fail, returns empty string '', expected '\ufeff'
bytes([0xFF,0xFF,0,0]).decode("utf-32")  # Ok, returns '\uffff'
bytes([0xFE,0xFE,0,0]).decode("utf-32")  # Ok, returns '\ufefe'

So I used ord() and chr() for encoding and decoding codepoints.

@jaime-m-p
Copy link
Collaborator Author

Also I'm looking for generating the table unicode_nfd_ranges.
I noticed that only the first codepoint is used, so no need to store in a multimap.
But somehow the table is much longer now than before.
When normalizing, there are many ranges mapped to the same NFD codepoint, so it's possible to encode the table with ranges: range_start, range_last, codepoint_nfd.
Doing so the table becomes smaller (<1500 entries).

@jaime-m-p
Copy link
Collaborator Author

We don't have to use wikitext specifically - if we generate or find a more diverse dataset, we can use that for tokenization tests

Maybe we can generate a not-so-random corpus trying to cover all category tokens.
Something like generating tokens from regex ...
Let me thing about this, but I'm not sure how much it can grow.

@jaime-m-p
Copy link
Collaborator Author

Instead of doing:

#define CODEPOINT_TYPE_UNIDENTIFIED 0
#define CODEPOINT_TYPE_NUMBER       1
#define CODEPOINT_TYPE_LETTER       2
#define CODEPOINT_TYPE_SEPARATOR    3
#define CODEPOINT_TYPE_ACCENT_MARK  4
#define CODEPOINT_TYPE_PUNCTUATION  5
#define CODEPOINT_TYPE_SYMBOL       6
#define CODEPOINT_TYPE_CONTROL      7

int unicode_cpt_type(uint32_t cp);
int unicode_cpt_type(const std::string & utf8);

What do you think about using bitfields?

struct unicode_codepoint_flags {
    // CODEPOINT_TYPE_*
    uint8_t is_undefined   : 1;
    uint8_t is_number      : 1;  // regex: \p{N}
    uint8_t is_letter      : 1;  // regex: \p{L}
    uint8_t is_separator   : 1;  // regex: \p{Z}
    uint8_t is_accent_mark : 1;  // regex: \p{M}
    uint8_t is_punctuation : 1;  // regex: \p{P}
    uint8_t is_symbol      : 1;  // regex: \p{S}
    uint8_t is_control     : 1;  // regex: \p{C}
    // helper flags
    uint8_t is_whitespace  : 1;  // regex: \s
    uint8_t is_lowercase   : 1;
    uint8_t is_uppercase   : 1;
    uint8_t is_nfd         : 1;
};

unicode_codepoint_flags unicode_cpt_flags(uint32_t cp);
unicode_codepoint_flags unicode_cpt_flags(const std::string & utf8);

This way we can track in parallel the unicode category and additional useful flags like \s.

@ggerganov
Copy link
Owner

Let's change the target branch to master

@reneleonhardt
Copy link
Contributor

@jaime-m-p good solution for bruteforce tests is coverage-guided fuzzing, see https://llvm.org/docs/LibFuzzer.html Founded problems may later be converted to unit tests.

A regex engine is forbidden because parsing by hand and syncing unicode tables is easier, but a huge library with dozens of files is an acceptable single-header/single-source solution? 🤔
https://github.com/llvm/llvm-project/tree/main/compiler-rt/lib/fuzzer

@ggerganov ggerganov changed the base branch from gg/bpe-preprocess to master May 8, 2024 11:52
@ggerganov
Copy link
Owner

What do you think about using bitfields?

Yes, seems like a good idea

p.s. I changed the target branch to master

@jaime-m-p
Copy link
Collaborator Author

jaime-m-p commented May 8, 2024

Threre are still models failing (bert, phi-3) but are not related to this PR (llama-3).

p.s. I changed the target branch to master

Ok, im going to open another with a new gen-unicode-data.py proposal using bitfields.

@jaime-m-p jaime-m-p closed this May 8, 2024
@jaime-m-p jaime-m-p reopened this May 8, 2024
@ggerganov ggerganov self-requested a review May 9, 2024 06:41
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered fail case using llama-bpe: Cửa Việt

src: 'Cửa Việt'
res: 'Cửa Việt'
tok: 34 91163 11655 26298 83 
main : failed test:    'Cửa Việt'
main : detokenized to: 'Cửa Việt' instead of 'Cửa Việt'
main : expected tokens:     34 'C',  91163 'ửa', 101798 ' Việt', 
main : got tokens:          34 'C',  91163 'ửa',  11655 ' Vi',  26298 'ệ',     83 't', 

This also fails on master

Other than that, I think this PR is great - nice job!
I couldn't test the random tokenizer tests because I'm on Mac atm, but will give this a try later on my Linux box.

Will merge this after the CI is green

@mofosyne mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
@mofosyne
Copy link
Collaborator

mofosyne commented May 9, 2024

gg's intent is to merge after the CI is green... all check has passsed... merging

@mofosyne mofosyne merged commit 43248e5 into ggerganov:master May 9, 2024
65 checks passed
@jaime-m-p
Copy link
Collaborator Author

I discovered fail case using llama-bpe: Cửa Việt

Oh, that one was not easy to fix.

The core of the problem is this missing if:
https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/models/bpe/model.rs#L466

When the ignore_merges variable is true it searches the word in the vocab, skipping all bpe merges.
We ended up with three sub-tokens instead of one, because there is no way to build Cửa Việt merging sub-tokens from tokenizer.json.

The ignore_merges variable is a config value inside tokenizer.json:

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": true,
    ...
  }

Not sure if we need recover this variable from the gguf file, or just hardcode to true.

This is a patch to check that it fixes the problem (llama-3 and gpt-2 passed the brute force tests):

diff --git a/llama.cpp b/llama.cpp
index da7c0d9f..73deb462 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12298,6 +12298,12 @@ struct llm_tokenizer_bpe {
             int index = 0;
             size_t offset = 0;
 
+            static const bool ignore_merges = true;
+            if(ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
+                symbols.emplace_back(llm_symbol{-1, -1, word.c_str(), word.size()});
+                offset = word.size();
+            }
+
             while (offset < word.size()) {
                 llm_symbol sym;
                 size_t char_len = std::min(word.size() - offset, (size_t) ::utf8_len(word[offset]));

Also I realized my brute force test is wrong. My intention here is to check all vocab words, but I'm stripping whitespaces... fix to come.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants