Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cloned voice is far from the reference speaker #220

Open
aicoder2048 opened this issue May 6, 2024 · 3 comments
Open

The cloned voice is far from the reference speaker #220

aicoder2048 opened this issue May 6, 2024 · 3 comments

Comments

@aicoder2048
Copy link

Hi,

I am trying out Open Voice (v1), and it mechanically worked, but the cloned voice is far from its reference speaker. Sometimes, I gave a male reference speaker mp3, and got back a female voice.

I run the code from "demo_part1.ipynb" and I only changed reference speaker's mp3.

I suspect the torch/embedding version is not compatible, and I am using:
(Speech2Rag) OpenVoice> pip show torch
Name: torch
Version: 2.1.2+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: C:\Users\Sean2092\miniconda3\Lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: pytorch-lightning, torchaudio, torchmetrics, torchvision

Could someone with success and experience help out? I am sure I got something, libs or settings, incorrect, but I cannot figure out what that might be. Pls help.

Thanks a lot,
Sean

@aicoder2048
Copy link
Author

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

@aicoder2048
Copy link
Author

I got the following warnings, could any of those warnings make the clone similarity to drastically degrade ?

  1. UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  2. UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
  3. UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED

image

@aicoder2048
Copy link
Author

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

I tried to use same (base-speaker) person's voice/mp3 for getting "source_se/tone color embedding" and "source audio to inference" , and a third male voice/mp3 as reference speaker. The resulting cloned audio, which sometime is female with a bit noise, is still far from the reference male audio. Very Bizarred !

so, to my conclusion from the experiment, the source_se and source audio to inference don't have to be from same person, or at least, it doesn't matter towards affecting/improving clone similarity.

just a couplel of sents to share ... have fun

Sean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant