-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
command-r : add BPE pre-tokenization (#7063)
* Add BPE pre-tokenization for Command-R/R+. * Bump transformers convert requirement. * command-r : add individual digits regex --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
- Loading branch information
1 parent
6fbd432
commit 889bdd7
Showing
9 changed files
with
168 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
ied 4 ½ months | ||
__ggml_vocab_test__ | ||
Führer | ||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
__ggml_vocab_test__ | ||
Hello world | ||
__ggml_vocab_test__ | ||
Hello world | ||
__ggml_vocab_test__ | ||
Hello World | ||
__ggml_vocab_test__ | ||
Hello World | ||
__ggml_vocab_test__ | ||
Hello World! | ||
__ggml_vocab_test__ | ||
Hello, world! | ||
__ggml_vocab_test__ | ||
Hello, world! | ||
__ggml_vocab_test__ | ||
this is 🦙.cpp | ||
__ggml_vocab_test__ | ||
w048 7tuijk dsdfhu | ||
__ggml_vocab_test__ | ||
нещо на Български | ||
__ggml_vocab_test__ | ||
កាន់តែពិសេសអាចខលចេញ | ||
__ggml_vocab_test__ | ||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token) | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
Hello | ||
__ggml_vocab_test__ | ||
( | ||
__ggml_vocab_test__ | ||
|
||
= | ||
__ggml_vocab_test__ | ||
' era | ||
__ggml_vocab_test__ | ||
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~ | ||
__ggml_vocab_test__ | ||
3 | ||
__ggml_vocab_test__ | ||
33 | ||
__ggml_vocab_test__ | ||
333 | ||
__ggml_vocab_test__ | ||
3333 | ||
__ggml_vocab_test__ | ||
33333 | ||
__ggml_vocab_test__ | ||
333333 | ||
__ggml_vocab_test__ | ||
3333333 | ||
__ggml_vocab_test__ | ||
33333333 | ||
__ggml_vocab_test__ | ||
333333333 | ||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL | ||
__ggml_vocab_test__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
2536 228 27 228 22957 6983 | ||
45 193433 | ||
|
||
228 | ||
1667 | ||
1742 | ||
205 | ||
206 | ||
2126 | ||
11516 | ||
34777 | ||
28339 3845 | ||
46609 3845 | ||
28339 3930 | ||
46609 3930 | ||
46609 3930 8 | ||
28339 19 3845 8 | ||
46609 19 3845 8 | ||
2075 1801 11254 107 255 21 19317 | ||
94 23 27 31 228 30 21213 20752 39267 6405 9980 | ||
4929 40071 2196 3236 8750 1764 37097 41168 | ||
38111 230 174833 38111 249 86325 241 38111 245 86325 232 38111 252 38111 123 38111 261 165 24629 38111 261 38111 103 174833 38111 235 38111 231 38111 257 38111 235 165 24629 38111 239 | ||
2226 256 230 1737 18258 16 80503 122 35927 2226 242 112 57462 1737 54457 223165 106230 2096 16 48389 1737 10203 109160 1875 2222 2517 3342 12523 16 | ||
28339 | ||
46609 | ||
228 46609 | ||
1667 46609 | ||
1742 46609 | ||
1742 46609 1856 46609 | ||
1737 | ||
206 1857 | ||
14 4515 | ||
28339 19 1770 14 1954 8 4070 1955 1933 80503 231 5691 12081 13336 2648 29325 14315 24 26 24 27 24 28 24 5123 18372 | ||
26 | ||
26 26 | ||
26 26 26 | ||
26 26 26 26 | ||
26 26 26 26 26 | ||
26 26 26 26 26 26 | ||
26 26 26 26 26 26 26 | ||
26 26 26 26 26 26 26 26 | ||
26 26 26 26 26 26 26 26 26 | ||
127731 51628 205 57788 18494 97469 126134 206 2226 256 230 1737 18258 16 80503 122 35927 2226 242 112 57462 1737 54457 223165 106230 2096 16 48389 11254 107 255 2226 107 255 228 26 228 26 26 228 26 26 26 228 26 26 26 26 228 26 26 26 26 26 228 26 26 26 26 26 26 228 26 26 26 26 26 26 26 228 26 26 26 26 26 26 26 26 228 26 21 26 228 26 2271 26 228 26 3834 26 182018 230 174833 38111 249 86325 241 38111 245 86325 232 38111 252 38111 123 38111 261 165 24629 38111 261 38111 103 174833 38111 235 188568 231 5691 12081 13336 2648 29325 14315 24 26 24 27 24 28 24 5123 18372 8391 158343 3512 40071 2196 3236 8750 1764 37097 41168 29721 32797 25646 3802 4975 4975 116167 57178 10251 154048 27292 1767 5125 2632 2155 91 2378 1919 1914 2782 19 2155 3354 1933 5470 38 2155 52 2068 5470 1767 4961 3059 1894 19 2155 43 1933 3026 2725 23186 38 2930 14 20676 1671 14 83 51 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
numpy~=1.24.4 | ||
sentencepiece~=0.1.98 | ||
transformers>=4.35.2,<5.0.0 | ||
transformers>=4.40.1,<5.0.0 | ||
gguf>=0.1.0 | ||
protobuf>=4.21.0,<5.0.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters