Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3 custom regex split #6965

Merged
merged 88 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
6fbab2d
merged the changes from deepseeker models to main branch
jaggzh Feb 12, 2024
d2cfc22
Moved regex patterns to unicode.cpp and updated unicode.h
dragnil1 Mar 22, 2024
54f93eb
Moved header files
dragnil1 Mar 22, 2024
1c924e4
Resolved issues
dragnil1 Mar 23, 2024
4056dc5
added and refactored unicode_regex_split and related functions
dragnil1 Mar 31, 2024
c8e7d95
Updated/merged the deepseek coder pr
jaggzh Feb 12, 2024
4c3e882
Refactored code
dragnil1 Apr 13, 2024
a5710a4
Adding unicode regex mappings
dragnil1 Apr 15, 2024
7e308ed
Adding unicode regex function
dragnil1 Apr 15, 2024
feeaf4f
Added needed functionality, testing remains
dragnil1 Apr 15, 2024
7535803
Fixed issues
dragnil1 Apr 15, 2024
36d9832
Fixed issue with gpt2 regex custom preprocessor
dragnil1 Apr 17, 2024
06d3e69
unicode : fix? unicode_wstring_to_utf8
ggerganov Apr 26, 2024
c56e19d
lint : fix whitespaces
ggerganov Apr 26, 2024
7a44e44
tests : add tokenizer tests for numbers
ggerganov Apr 26, 2024
d999cf6
unicode : remove redundant headers
ggerganov Apr 26, 2024
aeafb43
tests : remove and rename tokenizer test scripts
ggerganov Apr 26, 2024
e1b2bf7
tests : add sample usage
ggerganov Apr 26, 2024
ed42711
gguf-py : reader prints warnings on duplicate keys
ggerganov Apr 26, 2024
4907e41
llama : towards llama3 tokenization support (wip)
ggerganov Apr 26, 2024
e8c206b
unicode : shot in the dark to fix tests on Windows
ggerganov Apr 26, 2024
e989176
unicode : first try custom implementations
ggerganov Apr 26, 2024
e3f6dc7
Merge branch 'master' into gg/bpe-preprocess
ggerganov Apr 26, 2024
9b4d63a
convert : add "tokenizer.ggml.pre" GGUF KV (wip)
ggerganov Apr 26, 2024
43e12ce
llama : use new pre-tokenizer type
ggerganov Apr 26, 2024
1b9b79d
convert : fix pre-tokenizer type writing
ggerganov Apr 26, 2024
8791e94
lint : fix
ggerganov Apr 26, 2024
a774d70
make : add test-tokenizer-0-llama-v3
ggerganov Apr 26, 2024
c160818
wip
ggerganov Apr 26, 2024
96965f6
models : add llama v3 vocab file
ggerganov Apr 27, 2024
ad92983
llama : adapt punctuation regex + add llama 3 regex
ggerganov Apr 27, 2024
4434c9d
minor
ggerganov Apr 27, 2024
a22645c
unicode : set bomb
ggerganov Apr 27, 2024
2affd0b
unicode : set bomb
ggerganov Apr 27, 2024
ce5485a
unicode : always use std::wregex
ggerganov Apr 27, 2024
91eaa41
unicode : support \p{N}, \p{L} and \p{P} natively
ggerganov Apr 27, 2024
581c4a0
unicode : try fix windows
ggerganov Apr 27, 2024
b97add5
unicode : category support via std::regex
ggerganov Apr 28, 2024
d63cc90
Merge branch 'master' into gg/bpe-preprocess
ggerganov Apr 28, 2024
e972e6c
unicode : clean-up
ggerganov Apr 28, 2024
ee6d1b3
unicode : simplify
ggerganov Apr 28, 2024
e11fe2f
llama3 custom regex split
Apr 28, 2024
7642973
convert : add convert-hf-to-gguf-update.py
ggerganov Apr 28, 2024
4e3e6d8
lint : update
ggerganov Apr 28, 2024
1c888eb
convert : add falcon
ggerganov Apr 28, 2024
1545550
unicode : normalize signatures
ggerganov Apr 28, 2024
491f233
lint : fix
ggerganov Apr 28, 2024
e8dd4a1
lint : fix
ggerganov Apr 28, 2024
02fd977
convert : remove unused functions
ggerganov Apr 28, 2024
0f9058c
convert : add comments
ggerganov Apr 28, 2024
7808150
convert : exercise contractions
ggerganov Apr 28, 2024
5cc4b2c
Using char32_t for codepoints
Apr 28, 2024
7b1210f
lint : fix
ggerganov Apr 28, 2024
6e4d2af
already exists unicode_tolower()
Apr 28, 2024
2a48873
Typing
Apr 28, 2024
0cf9ed3
Restore BOM
Apr 28, 2024
ef4cca9
cmake : refactor test targets
ggerganov Apr 29, 2024
43708d2
tests : refactor vocab tests
ggerganov Apr 29, 2024
c68d259
tests : add more vocabs and tests
ggerganov Apr 29, 2024
af05268
unicode : cleanup
ggerganov Apr 29, 2024
c21ab18
scripts : ignore new update script in check-requirements.sh
ggerganov Apr 29, 2024
866e394
Merge branch 'ggerganov:gg/bpe-preprocess' into gg/bpe-preprocess
jaime-m-p Apr 29, 2024
a0c870d
Fix merge
Apr 29, 2024
120cf37
models : add phi-3, mpt, gpt-2, starcoder
ggerganov Apr 29, 2024
9a7d430
tests : disable obsolete
ggerganov Apr 29, 2024
6d6ce93
tests : use faster bpe test
ggerganov Apr 29, 2024
3202676
llama : more prominent warning for old BPE models
ggerganov Apr 29, 2024
80cb312
tests : disable test-tokenizer-1-bpe due to slowness
ggerganov Apr 29, 2024
b66cdd1
Merge remote-tracking branch 'upstream/gg/bpe-preprocess' into gg/bpe…
Apr 29, 2024
5c38f6e
Move unused variable value
Apr 29, 2024
1d8fcc0
GPT2 custom regex split
Apr 29, 2024
2cd1eb0
Add alternative regex for custom aplit llama3
jaime-m-p Apr 30, 2024
0c6d820
Style
Apr 30, 2024
3e3e283
Add bruteforce random tests for token encoding
May 3, 2024
4d441e4
wip: fixing unicode codepoint ranges
May 3, 2024
798b576
Merge remote-tracking branch 'upstream/master' into gg/bpe-preprocess
May 4, 2024
69a49ac
Fix merge
May 4, 2024
8fd849e
Unicode tables: separator, lowercase, uppercase and whitespace
May 4, 2024
67832e5
llama3 custom regex split: fix \s
May 4, 2024
edf375d
Restore BOM
May 4, 2024
a5fa2fe
Style
May 7, 2024
def3d13
wip: generate NDF table
May 7, 2024
7761f8e
Ignore special tokens for testing
May 7, 2024
70ca1fe
Clean gen-unicode-data.py
May 8, 2024
77cbb79
Refactor random tokenizer test
May 8, 2024
ea47119
Merge branch 'master' into gg/bpe-preprocess
jaime-m-p May 8, 2024
8de8b6d
lint : fix
ggerganov May 9, 2024
12a7b69
tests : add fail test for llama-bpe
ggerganov May 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions convert-hf-to-gguf-update.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,7 @@ def get_vocab_base_pre(self, tokenizer) -> str:
"3333333",
"33333333",
"333333333",
# "Cửa Việt", # llama-bpe fails on this
chktxt,
]

Expand Down
2 changes: 1 addition & 1 deletion llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12488,7 +12488,7 @@ struct llm_tokenizer_wpm {
continue;
}
code = unicode_tolower(code);
if (type == CODEPOINT_TYPE_WHITESPACE) {
if (type == CODEPOINT_TYPE_SEPARATOR) {
code = ' ';
}
std::string s = unicode_cpt_to_utf8(code);
Expand Down
78 changes: 38 additions & 40 deletions scripts/gen-unicode-data.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,14 @@
import regex


def cpt_to_utf8_str(cpt):
if cpt <= 0xFF:
return bytes([cpt, 0, 0, 0])
elif cpt <= 0xFFFF:
return bytes([cpt & 0xFF, cpt >> 8, 0, 0])
elif cpt <= 0xFFFFFF:
return bytes([cpt & 0xFF, (cpt >> 8) & 0xFF, (cpt >> 16) & 0xFF, 0])
else:
return bytes([cpt & 0xFF, (cpt >> 8) & 0xFF, (cpt >> 16) & 0xFF, cpt >> 24])


def is_match(codepoint, regex_expr):
try:
res = regex.match(regex_expr, cpt_to_utf8_str(codepoint).decode('utf-32'))
return res is not None
except Exception:
return False


def get_matches(regex_expr):
regex_expr_compiled = regex.compile(regex_expr)
unicode_ranges = []
current_range = None

for codepoint in range(0x110000):
if is_match(codepoint, regex_expr):
char = chr(codepoint)
if regex_expr_compiled.match(char):
if current_range is None:
current_range = [codepoint, codepoint]
else:
Expand All @@ -40,27 +23,42 @@ def get_matches(regex_expr):
return unicode_ranges


def print_cat(cat, ranges):
print("const std::vector<std::pair<uint32_t, uint32_t>> unicode_ranges_{} = {{".format(cat)) # noqa: NP100
cnt = 0
for start, end in ranges:
if cnt % 4 != 0:
print(" ", end="") # noqa: NP100
print("{{0x{:08X}, 0x{:08X}}},".format(start, end), end="") # noqa: NP100
if cnt % 4 == 3:
print("") # noqa: NP100
cnt += 1

if cnt % 4 != 0:
print("") # noqa: NP100
def print_cat(mode, cat, ranges):
if mode == "range":
print("const std::vector<std::pair<uint32_t, uint32_t>> unicode_ranges_{} = {{".format(cat)) # noqa: NP100
if mode == "map":
print("const std::map<uint32_t, uint32_t> unicode_map_{} = {{".format(cat)) # noqa: NP100
for i, values in enumerate(ranges):
end = ",\n" if (i % 4 == 3 or i + 1 == len(ranges)) else ", "
values = ["0x%08X" % value for value in values]
print("{" + ", ".join(values) + "}", end=end) # noqa: NP100
print("};") # noqa: NP100
print("") # noqa: NP100


print_cat("number", get_matches(r'\p{N}'))
print_cat("letter", get_matches(r'\p{L}'))
print_cat("whitespace", get_matches(r'\p{Z}'))
print_cat("accent_mark", get_matches(r'\p{M}'))
print_cat("punctuation", get_matches(r'\p{P}'))
print_cat("symbol", get_matches(r'\p{S}'))
print_cat("control", get_matches(r'\p{C}'))
print_cat("range", "number", get_matches(r'\p{N}'))
print_cat("range", "letter", get_matches(r'\p{L}'))
print_cat("range", "separator", get_matches(r'\p{Z}'))
print_cat("range", "accent_mark", get_matches(r'\p{M}'))
print_cat("range", "punctuation", get_matches(r'\p{P}'))
print_cat("range", "symbol", get_matches(r'\p{S}'))
print_cat("range", "control", get_matches(r'\p{C}'))

print_cat("range", "whitespace", get_matches(r'\s'))


map_lowercase = []
map_uppercase = []
for codepoint in range(0x110000):
char = chr(codepoint)
lower = ord(char.lower()[0])
upper = ord(char.upper()[0])
if codepoint != lower:
map_lowercase.append((codepoint, lower))
if codepoint != upper:
map_uppercase.append((codepoint, upper))
print_cat("map", "lowercase", map_lowercase)
print_cat("map", "uppercase", map_uppercase)


# TODO: generate unicode_map_nfd