You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for the great model!
I noticed something strange while finetuning the model. Indeed, it seems that resuming finetuning actually resumes 1 epoch before the one specified.
Replicate
i finetuned the model using: accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
So far, everything went well, until the finetuning crashed (this was to be expected with the parameters I chose). The last epoch I used before it crashed was the [11/100]:
back in the terminal, I rerun the command: accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
I waited a bit to see that no epoch were stored. I am assuming it stored epoch_2nd_00009.pth.
Conclusion
It means that resuming finetuning probably actually uses the right epoch (the one I linked in the config.yml), but resumes finetuning under the wrong number (ie. 10 instead of 11). Thus, it might also uses the wrong parameters (diff epoch=10 so would be used on [11/100] if I understand well)
The text was updated successfully, but these errors were encountered:
Hello there,
First of all, thank you for the great model!
I noticed something strange while finetuning the model. Indeed, it seems that resuming finetuning actually resumes 1 epoch before the one specified.
Replicate
i finetuned the model using:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
So far, everything went well, until the finetuning crashed (this was to be expected with the parameters I chose). The last epoch I used before it crashed was the [11/100]:
At this point, as I am currently working on 11th epoch, the last completed one was the 10th epoch, saved as
epoch_2nd_00009.pth
Then, when I went to modify the config.yml, I set the following parameters to:
back in the terminal, I rerun the command:
accelerate launch --mixed_precision=fp16 --num_process=1 train_finetune_accelerate.py --config_path <path/to/config.yml>
what is now displayed is:
I waited a bit to see that no epoch were stored. I am assuming it stored
epoch_2nd_00009.pth
.Conclusion
It means that resuming finetuning probably actually uses the right epoch (the one I linked in the config.yml), but resumes finetuning under the wrong number (ie. 10 instead of 11). Thus, it might also uses the wrong parameters (diff epoch=10 so would be used on [11/100] if I understand well)
The text was updated successfully, but these errors were encountered: