Conditional-Diffusion-Audio

This repository contains the code of various experiments on conditioned Diffusion based Text-To-Speech generation.

Goal

Generate expressive speech samples given an input transcipt and some guidance vector. We experiment by controlling the diffusion process via Cross-attention over CLAP / ImageBind speaker embeddings. We first embedd the original audio signal by using a pretrained embedding model, obtaining a speech embedding which we show to be disentagnled from the content of the speech. The speech content is then controlled by the input transcript/original spectogram. By emplying a stochastic diffusion process for speech generation, we aim at diveristy in the generated samples while allowing for flexibility in speaker identity by conditional embedding guidance.

Experiments

We ran the following experiments:

Spectograms + CLAP Embeddings -> Audio

Sentence Embedidng + CLAP Embeddings -> Audio

Phoneme Embedding + CLAP Embeddings -> Audio

Additionally we also show that while Spectogram to SPectrogram generation using Audio diffusion Diffusion works, deplying a pretrained Vocoder (eg. VITS) doens't allow for speaker id transfer and novel voice generation. More details on this issues can be found int the extensive report from the MSc student Mathias Vogel.

Speaker disentagnlement in ImageBind and CLAP latent space

We show how the audio embeddings of pretrained models such as CLAP or ImageBind preserve speaker identity and are agnostic to actual speech content.

Proving the robustness of such models over content changes in the input audios, allows us to demostrate how these embeddings can be considered as a style representation of individual voices.

This allows for clustering in the embedding space and most importanlty renders these embedding spaces as strong candidates for audio-generation conditioning.

Code for speaker embedding using ImageBind and CLAP can be found in the subdirectory: ImageBind.

Here are also some results of speaker clustering:

CLAP

ImageBind

Prerequisites

Python >= 3.7
Pytorch with cuda (tested on 1.13)
VITS installation including monotonic_align

Installation

pip install torch torchvision torchaudio
pip install -r requirements.txt
pip install -e ./audio-diffusion-pytorch
pip install -e ./a-unet

Running the experiments

Set up a config file in the configs folder. You can then run the training with the following:

python train_vocoder.py --config $PATH_TO_CONFIG_FILE

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
I2SB		I2SB
ImageBind		ImageBind
a-unet		a-unet
assets		assets
audio-diffusion-pytorch		audio-diffusion-pytorch
auraloss		auraloss
code_dump		code_dump
configs		configs
scripts		scripts
vits		vits
.gitignore		.gitignore
README.md		README.md
custom_dataset.py		custom_dataset.py
funcs.py		funcs.py
process_flikr.py		process_flikr.py
process_lj.py		process_lj.py
process_vctk.py		process_vctk.py
requirements.txt		requirements.txt
resamplers.py		resamplers.py
train.py		train.py
train_TTSDiff.py		train_TTSDiff.py
train_vocoder.py		train_vocoder.py
train_vocoder_speakeremb.py		train_vocoder_speakeremb.py
utils.py		utils.py

icedoom888/Conditional-Diffusion-Audio

Folders and files

Latest commit

History

Repository files navigation

Conditional-Diffusion-Audio

Table of Contents

Goal

Experiments

Speaker disentagnlement in ImageBind and CLAP latent space

Prerequisites

Installation

Running the experiments

About

Resources

Stars

Watchers

Forks

Languages