Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not using VQVAE like stable diffusion? #40

Open
daidaiershidi opened this issue Jun 20, 2023 · 11 comments
Open

Why not using VQVAE like stable diffusion? #40

daidaiershidi opened this issue Jun 20, 2023 · 11 comments

Comments

@daidaiershidi
Copy link

Thank you for bringing interesting work. I'm curious if you have tried VQVAE?

@daidaiershidi
Copy link
Author

I tried it out, and found that the VQVAE downsampling rate is 8, which means that a sequence of length T will become a latent feature of length T/8 after being processed by the encoder. However, I encountered some issues:

  1. The FID score is not good enough. When the epoch is 5999, my VQVAE only achieves a score of 0.44, while the VAE in the paper achieves 0.24.

  2. When I use VQVAE to train the diffusion part, the model convergence is poor, and it may not even converge. For example, the final FID score is greater than 1.

@ChenFengYe
Copy link
Owner

ChenFengYe commented Jun 20, 2023

Interesting problem. We have already achieved the VQVAE below. Please refer to our MotionGPT project.

In our experiment, MotionGPT achieves 0.067 on the FID metric for the motion reconstruction task, which is quite different from your evaluations.

VQVAE

@daidaiershidi
Copy link
Author

I guess it's because I use a different backbone than T2M-GPT :(

@daidaiershidi
Copy link
Author

I hope I won't disturb you, but is the process of dataloader exactly the same as T2M-GPT, and when will the code of MotionGPT be released?

@ChenFengYe
Copy link
Owner

Of course, we will release MotionGPT just like this motion-latent-diffusion project. It could cost a week to a month to set up everything. You are right the VQVAE part is quite similar to T2M-GPT's.

@daidaiershidi
Copy link
Author

Have you tried VQVAE+Diffusion. My VQVAE performance is fine, but why the VQVAE+Diffusion result is very poor? : (

@ChenFengYe
Copy link
Owner

Diffusion models is oriniglly desgined for continuous data, like RGB values in images, while VQVAE outputs codebook for discrete representation. I guess you need some "discrete" diffusion models to support the idea of "VQVAE+Diffusion".

@ChenFengYe ChenFengYe reopened this Jun 28, 2023
@daidaiershidi
Copy link
Author

I am still a beginner in the field of motion generation, thank you very much for your answer.

@daidaiershidi
Copy link
Author

Diffusion models is oriniglly desgined for continuous data, like RGB values in images, while VQVAE outputs codebook for discrete representation. I guess you need some "discrete" diffusion models to support the idea of "VQVAE+Diffusion".

I am very curious what is the difference between an image and a motion sequence? Images are continuous in 2-D space, and motion sequences are continuous in the temporal dimension. But why does stable diffusion (VQVAE + diffusion) work well on images?

In fact, I found that the latent embedding obtained by diffusion passes through the quantization layer with index collapse. I think it may be because the distribution generated by the encoder of VQVAE is too hard for diffusion. The 'distribution' here does not mean that VQVAE can obtain as a continuous representation, but rather all the discrete representations are viewed as a batch of data which has a distribution to be learned by diffusion.

I did not use the discriminator of GAN and perceptual loss like VQGAN, if I had used them, I guess that the encoder of VQVAE might have been able to get a distribution that is easier to learn for diffusion. And SD uses a large scale dataset, the distribution obtained by VQVAE encoder will be smoother. And diffusion will learn more easily. But this is all my opinion as a beginner, I would like to ask you what you think

@MingCongSu
Copy link

Hi @daidaiershidi, have you figured out why VQVAE+diffusion works badly on motion?

Recently, I am also working on this scenario. (i.e. training VQVAE then use it to train latent diffusion model)
I also followed the paper to use latent extracted from encoder and before vector quantization layer.
However, the sampled motion looks frozen, even though the training process looks fine.

Do you guys have any idea about this? @ChenFengYe @daidaiershidi

@boyuaner
Copy link

Hi @daidaiershidi, have you figured out why VQVAE+diffusion works badly on motion?

Recently, I am also working on this scenario. (i.e. training VQVAE then use it to train latent diffusion model) I also followed the paper to use latent extracted from encoder and before vector quantization layer. However, the sampled motion looks frozen, even though the training process looks fine.

Do you guys have any idea about this? @ChenFengYe @daidaiershidi

Hi @MingCongSu , I recently ran into the same dilemma(frozen motions). Do you have any insights on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants