The cloned voice is far from the reference speaker #220

aicoder2048 · 2024-05-06T01:06:32Z

Hi,

I am trying out Open Voice (v1), and it mechanically worked, but the cloned voice is far from its reference speaker. Sometimes, I gave a male reference speaker mp3, and got back a female voice.

I run the code from "demo_part1.ipynb" and I only changed reference speaker's mp3.

I suspect the torch/embedding version is not compatible, and I am using:
(Speech2Rag) OpenVoice> pip show torch
Name: torch
Version: 2.1.2+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: C:\Users\Sean2092\miniconda3\Lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: pytorch-lightning, torchaudio, torchmetrics, torchvision

Could someone with success and experience help out? I am sure I got something, libs or settings, incorrect, but I cannot figure out what that might be. Pls help.

Thanks a lot,
Sean

aicoder2048 · 2024-05-06T01:47:22Z

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

aicoder2048 · 2024-05-06T02:11:37Z

I got the following warnings, could any of those warnings make the clone similarity to drastically degrade ?

UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED

aicoder2048 · 2024-05-06T02:18:12Z

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

I tried to use same (base-speaker) person's voice/mp3 for getting "source_se/tone color embedding" and "source audio to inference" , and a third male voice/mp3 as reference speaker. The resulting cloned audio, which sometime is female with a bit noise, is still far from the reference male audio. Very Bizarred !

so, to my conclusion from the experiment, the source_se and source audio to inference don't have to be from same person, or at least, it doesn't matter towards affecting/improving clone similarity.

just a couplel of sents to share ... have fun

Sean

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cloned voice is far from the reference speaker #220

The cloned voice is far from the reference speaker #220

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024

The cloned voice is far from the reference speaker #220

The cloned voice is far from the reference speaker #220

Comments

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024

aicoder2048 commented May 6, 2024