Skip to content

Commit

Permalink
falcon : fix regex
Browse files Browse the repository at this point in the history
  • Loading branch information
ggerganov committed May 2, 2024
1 parent 3a461db commit 3275e60
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12212,14 +12212,13 @@ struct llm_tokenizer_bpe {
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"[一-龥ࠀ-一가-퟿]+",
"\\p{N}+",
"\\p{N}",
});
break;
case LLAMA_VOCAB_PRE_TYPE_FALCON:
word_collection = unicode_regex_split(text, {
"[\\p{P}\\$\\+<=>\\^~\\|]+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
"\\p{N}+",
"[0-9][0-9][0-9]",
});
break;
Expand Down

0 comments on commit 3275e60

Please sign in to comment.