Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Lumina T2I 5B flow matching T2I DiT model #7909

Open
AmericanPresidentJimmyCarter opened this issue May 10, 2024 · 5 comments
Open

Comments

@AmericanPresidentJimmyCarter
Copy link
Contributor

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Sora unveils the potential of scaling Diffusion Transformer (DiT) for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family – a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT (PixArt-α), indicating that increasing the number of parameters significantly accelerates convergence of generative models without compromising visual quality. Our further comprehensive analysis underscores Lumina-T2X’s preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. Code and a series of checkpoints will be successively released to facilitate future research at https://github.com/Alpha-VLLM/Lumina-T2X. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.

https://huggingface.co/Alpha-VLLM/Lumina-T2I/blob/main/README.md
https://arxiv.org/pdf/2405.05945

@bghira
Copy link
Contributor

bghira commented May 12, 2024

i was working on this support but after seeing the results of the model i'm not sure it's ready to be added yet:

image

there's a lot of residual noise - like, a lot.. it reminds me of Pixart Sigma's similar issues

@JincanDeng
Copy link

i was working on this support but after seeing the results of the model i'm not sure it's ready to be added yet:

image

there's a lot of residual noise - like, a lot.. it reminds me of Pixart Sigma's similar issues

I meet a similar problem when using EDM training method. Do you have any idea about how the noise arise arises?

@bghira
Copy link
Contributor

bghira commented May 20, 2024

@JincanDeng how are you doing caption dropout? zeroes or "" prompt encoded by both TEs?

@JincanDeng
Copy link

@bghira I use "" prompt with 0.1 probability for dropout.

@bghira
Copy link
Contributor

bghira commented May 21, 2024

SDXL relies on zeroing uncond space, try using torch.zeros_like()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants