Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BPE tokenizers #62

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add BPE tokenizers #62

wants to merge 7 commits into from

Conversation

angeloskath
Copy link
Member

This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.

The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the Trie and it also adds a BPEMerges data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere except read_trie_from_spm which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.

Trie

  • Most things are moved to work with iterators internally which removes a bunch of std::vector<char> creations and copies.
  • Add search_longest_prefix which the Trie is perfect for
  • Add the ability to set the id when inserting
  • Changed the vector that holds the keys to an unordered_map to support the above

BPE

  • BPETokenizer::tokenize would be the most interesting function. It is not the prettiest implementation but it is pretty fast and beats SPM on my laptop. Possible room for improvement lines 135-160 where we search for neighbors with linear search.
  • read_bpe_from_spm ironically implements a small bpe in python to extract the merges from the file.

TL;DR

The following is implementing SPM tokenization so far with exactly identical results as spm or HF.

symbols, merges = read_bpe_from_spm("tokenizer.model")
ds = (
    ds
    .pad("text", 0, 1, 0, ord(" "))
    .replace("text", " ", "\u2581")
    .tokenize_bpe("text", symbols, merges)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant