Why not using VQVAE like stable diffusion? #40

daidaiershidi · 2023-06-20T01:21:04Z

Thank you for bringing interesting work. I'm curious if you have tried VQVAE?

daidaiershidi · 2023-06-20T03:24:45Z

I tried it out, and found that the VQVAE downsampling rate is 8, which means that a sequence of length T will become a latent feature of length T/8 after being processed by the encoder. However, I encountered some issues:

The FID score is not good enough. When the epoch is 5999, my VQVAE only achieves a score of 0.44, while the VAE in the paper achieves 0.24.
When I use VQVAE to train the diffusion part, the model convergence is poor, and it may not even converge. For example, the final FID score is greater than 1.

ChenFengYe · 2023-06-20T10:02:14Z

Interesting problem. We have already achieved the VQVAE below. Please refer to our MotionGPT project.

In our experiment, MotionGPT achieves 0.067 on the FID metric for the motion reconstruction task, which is quite different from your evaluations.

daidaiershidi · 2023-06-21T01:15:36Z

I guess it's because I use a different backbone than T2M-GPT :(

daidaiershidi · 2023-06-21T02:00:23Z

I hope I won't disturb you, but is the process of dataloader exactly the same as T2M-GPT, and when will the code of MotionGPT be released?

ChenFengYe · 2023-06-21T02:37:42Z

Of course, we will release MotionGPT just like this motion-latent-diffusion project. It could cost a week to a month to set up everything. You are right the VQVAE part is quite similar to T2M-GPT's.

daidaiershidi · 2023-06-28T06:16:04Z

Have you tried VQVAE+Diffusion. My VQVAE performance is fine, but why the VQVAE+Diffusion result is very poor? : (

ChenFengYe · 2023-06-28T07:58:11Z

Diffusion models is oriniglly desgined for continuous data, like RGB values in images, while VQVAE outputs codebook for discrete representation. I guess you need some "discrete" diffusion models to support the idea of "VQVAE+Diffusion".

daidaiershidi · 2023-06-28T08:12:51Z

I am still a beginner in the field of motion generation, thank you very much for your answer.

daidaiershidi · 2023-06-28T09:47:50Z

Diffusion models is oriniglly desgined for continuous data, like RGB values in images, while VQVAE outputs codebook for discrete representation. I guess you need some "discrete" diffusion models to support the idea of "VQVAE+Diffusion".

I am very curious what is the difference between an image and a motion sequence? Images are continuous in 2-D space, and motion sequences are continuous in the temporal dimension. But why does stable diffusion (VQVAE + diffusion) work well on images?

In fact, I found that the latent embedding obtained by diffusion passes through the quantization layer with index collapse. I think it may be because the distribution generated by the encoder of VQVAE is too hard for diffusion. The 'distribution' here does not mean that VQVAE can obtain as a continuous representation, but rather all the discrete representations are viewed as a batch of data which has a distribution to be learned by diffusion.

I did not use the discriminator of GAN and perceptual loss like VQGAN, if I had used them, I guess that the encoder of VQVAE might have been able to get a distribution that is easier to learn for diffusion. And SD uses a large scale dataset, the distribution obtained by VQVAE encoder will be smoother. And diffusion will learn more easily. But this is all my opinion as a beginner, I would like to ask you what you think

MingCongSu · 2024-04-13T18:45:01Z

Hi @daidaiershidi, have you figured out why VQVAE+diffusion works badly on motion?

Recently, I am also working on this scenario. (i.e. training VQVAE then use it to train latent diffusion model)
I also followed the paper to use latent extracted from encoder and before vector quantization layer.
However, the sampled motion looks frozen, even though the training process looks fine.

Do you guys have any idea about this? @ChenFengYe @daidaiershidi

boyuaner · 2024-05-24T06:18:04Z

Hi @daidaiershidi, have you figured out why VQVAE+diffusion works badly on motion?

Recently, I am also working on this scenario. (i.e. training VQVAE then use it to train latent diffusion model) I also followed the paper to use latent extracted from encoder and before vector quantization layer. However, the sampled motion looks frozen, even though the training process looks fine.

Do you guys have any idea about this? @ChenFengYe @daidaiershidi

Hi @MingCongSu , I recently ran into the same dilemma(frozen motions). Do you have any insights on this?

daidaiershidi closed this as completed Jun 21, 2023

daidaiershidi reopened this Jun 21, 2023

daidaiershidi closed this as completed Jun 25, 2023

ChenFengYe reopened this Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not using VQVAE like stable diffusion? #40

Why not using VQVAE like stable diffusion? #40

daidaiershidi commented Jun 20, 2023

daidaiershidi commented Jun 20, 2023

ChenFengYe commented Jun 20, 2023 •

edited

daidaiershidi commented Jun 21, 2023

daidaiershidi commented Jun 21, 2023

ChenFengYe commented Jun 21, 2023

daidaiershidi commented Jun 28, 2023

ChenFengYe commented Jun 28, 2023

daidaiershidi commented Jun 28, 2023

daidaiershidi commented Jun 28, 2023

MingCongSu commented Apr 13, 2024

boyuaner commented May 24, 2024

Why not using VQVAE like stable diffusion? #40

Why not using VQVAE like stable diffusion? #40

Comments

daidaiershidi commented Jun 20, 2023

daidaiershidi commented Jun 20, 2023

ChenFengYe commented Jun 20, 2023 • edited

daidaiershidi commented Jun 21, 2023

daidaiershidi commented Jun 21, 2023

ChenFengYe commented Jun 21, 2023

daidaiershidi commented Jun 28, 2023

ChenFengYe commented Jun 28, 2023

daidaiershidi commented Jun 28, 2023

daidaiershidi commented Jun 28, 2023

MingCongSu commented Apr 13, 2024

boyuaner commented May 24, 2024

ChenFengYe commented Jun 20, 2023 •

edited