You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, so I've noticed that when using the /tokenize endpoint with mistral-7b models, a space gets prepended to content. E.g. tokenizing The returns the ID for The, and subsequently, trying to tokenize The actually returns the IDs for and The, which is a real headache.
After digging around for quite a while, I noticed that the tokenizer.json file that's included with the .safetensor weights has the following code:
I was wondering if this was the cause for my problems, and if it is, if there was any way disable this normalization step for the /tokenize endpoint in llama.cpp.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey, so I've noticed that when using the
and
/tokenize
endpoint with mistral-7b models, a space gets prepended to content. E.g. tokenizingThe
returns the ID forThe
, and subsequently, trying to tokenizeThe
actually returns the IDs forThe
, which is a real headache.After digging around for quite a while, I noticed that the
tokenizer.json
file that's included with the.safetensor
weights has the following code:I was wondering if this was the cause for my problems, and if it is, if there was any way disable this normalization step for the
/tokenize
endpoint in llama.cpp.Beta Was this translation helpful? Give feedback.
All reactions