FSDP Finetuned Model-optimizer and tokenizer #476

waterluck · 2024-04-30T19:08:12Z

Thanks for the tutorials! I have several small questions about the model ft and usage.

When doing Full parameter finetune using FSDP only,

Q1: should we use save_optimizer to True or not?
I first set it to True, and I found the model goes to very large, I fine-tuned on 10K Pawsx data samples, got __0_0.distcp ~ __3_0.distcp with each file 9.4GB large, and 2 extra optimizer-xxx.pt file like optimizer-llama-2-7b-0.pt with 25GB each.
And when I set it to false, I got 4x__0_0.distcp file from 0 to 4, with 3.14GB each.
I'm unsure whether it's normal or not to be that large, and whether save_optimizer is necessary.

Q2: Is the llama2-xB-hf and llama2-xB-hf-chat series model use the same tokenizer?
There's no tokenizer.model file from fine-tuned model, and I noticed the size of these 2 model's tokenizer files looks the same in the official repository;
I want to know whether their tokenizer remains the same, especially the tokenizer_model in the model file.
also, can we use fast_tokenizer in llama2?

Q3: When SFT on llama on classification task, with a single target label, is there any influence if not train on input, which sets the input to be -100`.

Thanks if you can take a look of these questions.

The text was updated successfully, but these errors were encountered:

mreso · 2024-05-02T23:48:03Z

Hi @waterluck

Q1: What looks a bit weird to me is that the __0_X.distcp files get bigger when you you store the optimizer as well. Will need to look into this to confirm this is right or an error.
Regarding if saving the optimizer is necessary depends on your use case. I you want to continue the training from that point on the optimizer state can be useful as you're not doing a cold start but the optimizer state contains the current optimization direction on the surface thats optimized. If you are not planning to continue the training you can skip saving it.
Q2: the tokenizer for these models are equivalent as the chat variant is a fine-tuned version of the base model. If you use AutoTokenizer it will automatically select the fast tokenizer if available.

Hope that helps.

waterluck · 2024-05-03T07:12:58Z

Hi @mreso , thanks for the confirmation! also regrading the whole finetuning process, I noticed that when run several times with all the same parameter settings, the loss at each epoch differs big, I checked all the parameters is the same, and I didn't change the random_seed(which I think is fixed to 42), is this expected? or if there any other steps in the code can bring randomness.

mreso · 2024-05-03T19:11:41Z

Some ops use non-deterministic algorithms so some fluctuation is expected. See https://pytorch.org/docs/stable/notes/randomness.html if you can disable non-deterministic behavior but beware that this will have an impact on your training performance.

waterluck · 2024-05-04T06:14:12Z

Great! Thanks for your answer, it helps a lot.

mreso added the question Further information is requested label May 2, 2024

waterluck closed this as completed May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP Finetuned Model-optimizer and tokenizer #476

FSDP Finetuned Model-optimizer and tokenizer #476

waterluck commented Apr 30, 2024 •

edited

mreso commented May 2, 2024

waterluck commented May 3, 2024

mreso commented May 3, 2024

waterluck commented May 4, 2024

FSDP Finetuned Model-optimizer and tokenizer #476

FSDP Finetuned Model-optimizer and tokenizer #476

Comments

waterluck commented Apr 30, 2024 • edited

mreso commented May 2, 2024

waterluck commented May 3, 2024

mreso commented May 3, 2024

waterluck commented May 4, 2024

waterluck commented Apr 30, 2024 •

edited