Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py: When --vocab-only is passed, generate false but valid params #7027

Merged
merged 2 commits into from May 8, 2024

Conversation

20kdc
Copy link
Contributor

@20kdc 20kdc commented May 1, 2024

Vocab files presently include their source model hyperparameter information.

'Faking it' allows vocab and model creation solely from tokenizer.model or similar.

An example of how this might be used in the style of baby-llama (this should be considered under the same MIT license as the rest of this PR):
example.zip

Particular applications of these custom vocabs may be non-language non-safety-critical uses of LLMs where the versatility of the LLM model is useful (I was thinking virtual pets, personally), or small models being trained to work with languages where dedicating more tokenization effort to the language may help boost performance (for much the same reasons that tokens are used in the first place).

Note that if --pad-vocab is given, then this would alter the vocab based on the real params, so the params must be loaded in this case.

…ams to allow vocab creation solely from tokenizer.model

An example of how this might be used in the style of baby-llama will be attached with this PR.
@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

You can extract the vocab and build the model from the extracted vocab already. Is this PR supposed to create a "dummy" vocab from an existing one? If so, why not create a script that gives more fine grained control to specify the desired params?

I think this idea has potential, I'm just not sure if this is the way to go about it. Just my two-cents. Take what I'm saying as a grain of salt.

@20kdc
Copy link
Contributor Author

20kdc commented May 2, 2024

You can extract the vocab and build the model from the extracted vocab already. Is this PR supposed to create a "dummy" vocab from an existing one? If so, why not create a script that gives more fine grained control to specify the desired params?

No, this is the opposite. I highly recommend seeing the example ZIP to see what it's doing; in short, the example ZIP does not come with any model or model params.
This is for synthesis of entirely original vocabs and models from those vocabs.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

Is there an uncompressed version? I don't download random zip files.

Or even better yet, instructions on how to reproduce the results would be better.

I can pull your branch in and test that way after reviewing the changes more indepth.

@20kdc
Copy link
Contributor Author

20kdc commented May 2, 2024

test.txt:

#
 #
  #
   #
  #
 #
#
 #
  #
   #
  #
 #
#
 #

(...repeat ad infinitum)

build-vocab:

#!/bin/sh
spm_train --input vocab-src.txt --model_prefix tokenizer --vocab_size 261 --byte_fallback true
../llama.cpp/convert.py . --vocab-only --vocab-type spm --outfile vocab.gguf

(edit: oops, the mv was on the wrong line and I missed it because I hadn't cleaned out the test environment)
(edit 2: turns out the .vocab file is not actually what's wanted and the mv is useless. fixed now)

train:

#!/bin/sh
../llama.cpp/build/bin/train-text-from-scratch --vocab-model vocab.gguf --train-data test.txt

vocab-src.txt:

#
 #
  #
   #

  #
 #

#
   #

The scripts are the instructions, because these require a lot of fiddly command line options and manually typing them every time is very annoying.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 3, 2024

Yeah, I'm sold. I think this is an excellent idea. I think we could probably create a custom script for this.

I'm thinking maybe we can do a convert-spm-to-gguf.py. It can take a tokenizer.model as input and then output a custom tokenizer.gguf. What do you think?

Something to note-as an aside-is that I think this is a great short term solution, but a more robust long term solution might be to properly implement a custom tokenizer. This is a great start, though.

@20kdc
Copy link
Contributor Author

20kdc commented May 4, 2024

I'm thinking maybe we can do a convert-spm-to-gguf.py. It can take a tokenizer.model as input and then output a custom tokenizer.gguf. What do you think?

The problem with this approach is that it doesn't support all tokenizers that can be imported, which may be detrimental. SPM was used because it was the simplest to configure, but any tokenizer should be usable.

Something to note-as an aside-is that I think this is a great short term solution, but a more robust long term solution might be to properly implement a custom tokenizer. This is a great start, though.

There's definitely merit to this, but it would interfere when someone wants to use, say, sentencepiece training for their custom model.

@slaren
Copy link
Collaborator

slaren commented May 4, 2024

Would it make more sense to modify the training examples to accept the vocab only models without hparams?

@20kdc
Copy link
Contributor Author

20kdc commented May 4, 2024

Would it make more sense to modify the training examples to accept the vocab only models without hparams?

If that's doable, then it would, but then that's potentially a format break anyway...
I assumed that if it was as simple as that then vocab only conversions wouldn't include hparams in the first place; looking at the code kinda solidified this.
As it is that can always be a follow-up to this PR.

@teleprint-me
Copy link
Contributor

@20kdc

The problem with this approach is that it doesn't support all tokenizers that can be imported, which may be detrimental. SPM was used because it was the simplest to configure, but any tokenizer should be usable.

I see what you mean. That wasn't my intention. I was excited in the moment when I thought about the potential. I think loading any vocabulary would be ideal which is the point of the conversion script. I was thinking that it would be nice to be able to specify the hyperparameters from the CLI.

There's definitely merit to this, but it would interfere when someone wants to use, say, sentencepiece training for their custom model.

The tokenizers are baked into the converted GGUF's so they can be inferenced after being trained/finetuned. So maybe I misunderstood?

My understanding is that the vocab is extracted from the source vocabulary and then converted to a GGUF compatible format. This allows us to train and finetune with the extracted vocab.

The idea with the conversion script is so that we can take a custom sentencepiece tokenizer (or any other tokenizer, vocab, etc) and convert it to a proper GGUF to use for training and finetuning. I just thought working with spm models first-to experiment with-would be easier.

In any case, the GGUF format bakes the tokenizer into the model, which is convenient. It's a detail I really appreciate.

@20kdc
Copy link
Contributor Author

20kdc commented May 5, 2024

I see what you mean. That wasn't my intention. I was excited in the moment when I thought about the potential. I think loading any vocabulary would be ideal which is the point of the conversion script. I was thinking that it would be nice to be able to specify the hyperparameters from the CLI.

train-text-from-scratch ignores most hyperparameters from the vocab model (notably, it does not ignore n_vocab, but this is of course tied to the tokenizer); the other hyperparameters are specified to train-text-from-scratch. Vocab model 'parameters' are just dummies to keep the model loader happy. So there's no need to specify these hyperparameters during vocab conversion.

@teleprint-me
Copy link
Contributor

Vocab model 'parameters' are just dummies to keep the model loader happy. So there's no need to specify these hyperparameters during vocab conversion.

We do need them to inference. Regardless, I agree with you. I think we're on the same page. As you suggested, it's probably best handled in another PR.

@ggerganov ggerganov merged commit ad211ed into ggerganov:master May 8, 2024
22 checks passed
@ggerganov
Copy link
Owner

ggerganov commented May 8, 2024

While working on an unrelated PR, it looks like there is a much simpler solution as @slaren suggested earlier:

JoanFM@b7ede48#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR3803-R3808

@@ -3800,6 +3800,12 @@ static void llm_load_hparams(
 
     // get hparams kv
     ml.get_key(LLM_KV_VOCAB_SIZE,           hparams.n_vocab,       false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab);
+
+    // everything past this point is not vocab-related
+    if (hparams.vocab_only) {
+        return;
+    }
+
     ml.get_key(LLM_KV_CONTEXT_LENGTH,       hparams.n_ctx_train);
     ml.get_key(LLM_KV_EMBEDDING_LENGTH,     hparams.n_embd);
     ml.get_key(LLM_KV_FEED_FORWARD_LENGTH,  hparams.n_ff);

This change would allow to load vocab-only models without any changes to the convert script. @20kdc If you could give this a try and confirm that it works, it might be better to revert the changes from this PR

@20kdc
Copy link
Contributor Author

20kdc commented May 8, 2024

Most of the changes in this PR are necessary for the convert script to function at all when no hyperparameters are given.
If the PR is fully reverted then a vocab-only input model conversion will fail at params = Params.load(model_plus).
The main thing that would change given the above C++-side change is that OutputFile.write_vocab_only would no longer need full hyperparameters (but it would still require params.n_vocab for vocab padding logic).

@teleprint-me
Copy link
Contributor

@ggerganov Is that used in gguf when writing the output file? Is there a way to leverage this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants