Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP Finetuned Model-optimizer and tokenizer #476

Closed
waterluck opened this issue Apr 30, 2024 · 4 comments
Closed

FSDP Finetuned Model-optimizer and tokenizer #476

waterluck opened this issue Apr 30, 2024 · 4 comments
Labels
question Further information is requested

Comments

@waterluck
Copy link

waterluck commented Apr 30, 2024

Thanks for the tutorials! I have several small questions about the model ft and usage.

When doing Full parameter finetune using FSDP only,

Q1: should we use save_optimizer to True or not?
I first set it to True, and I found the model goes to very large, I fine-tuned on 10K Pawsx data samples, got __0_0.distcp ~ __3_0.distcp with each file 9.4GB large, and 2 extra optimizer-xxx.pt file like optimizer-llama-2-7b-0.pt with 25GB each.
And when I set it to false, I got 4x__0_0.distcp file from 0 to 4, with 3.14GB each.
I'm unsure whether it's normal or not to be that large, and whether save_optimizer is necessary.

Q2: Is the llama2-xB-hf and llama2-xB-hf-chat series model use the same tokenizer?
There's no tokenizer.model file from fine-tuned model, and I noticed the size of these 2 model's tokenizer files looks the same in the official repository;
I want to know whether their tokenizer remains the same, especially the tokenizer_model in the model file.
also, can we use fast_tokenizer in llama2?

Q3: When SFT on llama on classification task, with a single target label, is there any influence if not train on input, which sets the input to be -100`.

Thanks if you can take a look of these questions.

@mreso
Copy link
Contributor

mreso commented May 2, 2024

Hi @waterluck

Q1: What looks a bit weird to me is that the __0_X.distcp files get bigger when you you store the optimizer as well. Will need to look into this to confirm this is right or an error.
Regarding if saving the optimizer is necessary depends on your use case. I you want to continue the training from that point on the optimizer state can be useful as you're not doing a cold start but the optimizer state contains the current optimization direction on the surface thats optimized. If you are not planning to continue the training you can skip saving it.
Q2: the tokenizer for these models are equivalent as the chat variant is a fine-tuned version of the base model. If you use AutoTokenizer it will automatically select the fast tokenizer if available.

Hope that helps.

@mreso mreso added the question Further information is requested label May 2, 2024
@waterluck
Copy link
Author

Hi @mreso , thanks for the confirmation! also regrading the whole finetuning process, I noticed that when run several times with all the same parameter settings, the loss at each epoch differs big, I checked all the parameters is the same, and I didn't change the random_seed(which I think is fixed to 42), is this expected? or if there any other steps in the code can bring randomness.

image

@mreso
Copy link
Contributor

mreso commented May 3, 2024

Some ops use non-deterministic algorithms so some fluctuation is expected. See https://pytorch.org/docs/stable/notes/randomness.html if you can disable non-deterministic behavior but beware that this will have an impact on your training performance.

@waterluck
Copy link
Author

Great! Thanks for your answer, it helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants