Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming finetuning uses second to last epoch #238

Open
SimonDemarty opened this issue May 17, 2024 · 0 comments
Open

Resuming finetuning uses second to last epoch #238

SimonDemarty opened this issue May 17, 2024 · 0 comments

Comments

@SimonDemarty
Copy link

SimonDemarty commented May 17, 2024

Hello there,

First of all, thank you for the great model!
I noticed something strange while finetuning the model. Indeed, it seems that resuming finetuning actually resumes 1 epoch before the one specified.

Replicate

i finetuned the model using:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>

So far, everything went well, until the finetuning crashed (this was to be expected with the parameters I chose). The last epoch I used before it crashed was the [11/100]:

Epoch [11/100], Step [36/685], Loss: 0.29078, Disc Loss: 3.66850, Dur Loss: 0.34166, CE Loss: 0.01789, Norm Loss: 0.43014, F0 Loss: 1.61470, LM Loss: 1.13319, Gen Loss: 7.04091, Sty Loss: 0.08961, Diff Loss: 0.47579, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.04884, Mono Loss: 0.07268
Time elasped: 85.4826250076294
Epoch [11/100], Step [37/685], Loss: 0.22887, Disc Loss: 3.83498, Dur Loss: 0.33926, CE Loss: 0.01526, Norm Loss: 0.20141, F0 Loss: 0.96320, LM Loss: 0.77488, Gen Loss: 6.66455, Sty Loss: 0.08425, Diff Loss: 0.59685, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.00954, Mono Loss: 0.11665
Time elasped: 87.4490795135498

At this point, as I am currently working on 11th epoch, the last completed one was the 10th epoch, saved as epoch_2nd_00009.pth

Then, when I went to modify the config.yml, I set the following parameters to:

pretrained_model: "path/to/epoch_2nd_00009.pth"
load_only_params: false

back in the terminal, I rerun the command:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>

what is now displayed is:

Epoch [10/100], Step [1/685], Loss: 0.31872, Disc Loss: 3.71658, Dur Loss: 0.43845, CE Loss: 0.01963, Norm Loss: 0.27553, F0 Loss: 1.27736, LM Loss: 0.93941, Gen Loss: 7.35116, Sty Loss: 0.00000, Diff Loss: 0.00000, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.05938, Mono Loss: 0.05683
Time elasped: 2.4969496726989746

I waited a bit to see that no epoch were stored. I am assuming it stored epoch_2nd_00009.pth.

Conclusion

It means that resuming finetuning probably actually uses the right epoch (the one I linked in the config.yml), but resumes finetuning under the wrong number (ie. 10 instead of 11). Thus, it might also uses the wrong parameters (diff epoch=10 so would be used on [11/100] if I understand well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant