Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peak Performance of Single vs. Multi-Speaker TTS Models: Seeking Insights and References #185

Open
ikpark09 opened this issue Sep 14, 2023 · 0 comments

Comments

@ikpark09
Copy link

Hello. First and foremost, I'd like to express my gratitude to everyone who has provided various assistance. Thanks to you, I've been able to progress in my studies, create a VITS TTS model, and synthesize voices.

I have a question about the peak performance of the model based on the training data. By "peak performance," I mean the model's ability to read a variety of texts accurately.

Here's what I've observed during my training:
A single-speaker model trained with 12 hours of Speaker A's data, when fine-tuned with 2 hours of Speaker B's data, seems to have a lower peak performance than the original model trained solely on Speaker A's data. (For fine-tuning, I used different data for B than A. Would the results of fine-tuning be better if I used the same text as A's data?)

Therefore, I believe securing a base model with high peak performance is crucial. While I think it would be beneficial to obtain and train with more high-quality single-speaker data, it has been challenging to acquire additional data in reality.

Suppose we have models trained in the following ways:

Single-speaker model1 trained with 10 hours of Speaker A's data.
Multi-speaker model2 trained with model1 + 2 hours of Speaker B's data + 2 hours of Speaker C's data.
Model3 fine-tuned with model1 + 2 hours of Speaker B's data.
I suspect the order of peak performance would be model2 > model1 > model3. Is this correct?
I've tried to find clear explanations or evidence, such as papers or articles, for this case but haven't been successful.

My concern is that when training a multi-speaker model, even if there's a lot of overall data, the data for each speaker is separated by speaker IDs.
So, would the data for each speaker not contribute to the peak performance?

Testing and comparing all these cases would require a lot of resources and time, so I'm seeking assistance here. Does anyone have experience or related materials to share on this topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant