Disable "normalizer" from tokenizer.json #6856

neCo2 · 2024-04-23T21:29:04Z

neCo2
Apr 23, 2024

Hey, so I've noticed that when using the /tokenize endpoint with mistral-7b models, a space gets prepended to content. E.g. tokenizing The returns the ID for The, and subsequently, trying to tokenize The actually returns the IDs for and The, which is a real headache.
After digging around for quite a while, I noticed that the tokenizer.json file that's included with the .safetensor weights has the following code:

  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Prepend",
        "prepend": "▁"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": "▁"
      }
    ]
  },

I was wondering if this was the cause for my problems, and if it is, if there was any way disable this normalization step for the /tokenize endpoint in llama.cpp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable "normalizer" from tokenizer.json #6856

{{title}}

Replies: 0 comments

Select a reply

Disable "normalizer" from tokenizer.json #6856

neCo2 Apr 23, 2024

Replies: 0 comments

neCo2
Apr 23, 2024