Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors happened in the inference process #338

Closed
reich208github opened this issue Apr 27, 2024 · 3 comments
Closed

errors happened in the inference process #338

reich208github opened this issue Apr 27, 2024 · 3 comments
Labels

Comments

@reich208github
Copy link

hi, guys

after i run the inference command:

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt

errors are prompted, and the completed outputs are like follows:

[2024-04-27 14:23:05,034] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Config (path: configs/opensora/inference/16x256x256.py): {'num_frames': 16, 'fps': 8, 'image_size': (256, 256), 'model': {'type': 'STDiT-XL/2', 'space_scale': 0.5, 'time_scale': 1.0, 'enable_flashattn': True, 'enable_layernorm_kernel': True, 'from_pretrained': 'OpenSora-v1-HQ-16x256x256.pth'}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': 'stabilityai/sd-vae-ft-ema', 'micro_batch_size': 4}, 'text_encoder': {'type': 't5', 'from_pretrained': 'DeepFloyd/t5-v1_1-xxl', 'model_max_length': 120}, 'scheduler': {'type': 'iddpm', 'num_sampling_steps': 100, 'cfg_scale': 7.0}, 'dtype': 'fp16', 'batch_size': 1, 'seed': 42, 'prompt_path': './assets/texts/t2v_samples.txt', 'save_dir': './outputs/samples/', 'multi_resolution': False}
/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: config is deprecated and will be removed soon.
warnings.warn("config is deprecated and will be removed soon.")
[04/27/24 14:23:14] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.11s/it]
Missing keys: []
Unexpected keys: []
0%| | 0/100 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 112, in
main()
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 93, in main
samples = scheduler.sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 72, in sample
samples = self.p_sample_loop(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 434, in p_sample_loop
for sample in self.p_sample_loop_progressive(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 485, in p_sample_loop_progressive
out = self.p_sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 388, in p_sample
out = self.p_mean_variance(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 94, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 267, in p_mean_variance
model_output = model(x, t, **model_kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 127, in call
return self.model(x, new_ts, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 89, in forward_with_cfg
model_out = model.forward(combined, timestep, y, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 267, in forward
x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint
return module(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 98, in forward
x_s = self.attn(x_s)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 152, in forward
from flash_attn import flash_attn_func
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2
[2024-04-27 14:24:15,201] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 14511) of binary: /root/anaconda3/envs/env_open_sora/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/env_open_sora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-27_14:24:15
host : yilinchen-X10SRA
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14511)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

my installed softwares related with opensora-1.0.0 are like these:

apex 0.1
flash-attn 2.5.6
ninja 1.11.1.1
torch 2.1.2+cu121
torchaudio 2.1.2+cu121
torchvision 0.16.2+cu121
xformers 0.0.23.post1
packaging 24.0

the cuda version and pytorch related cuda version are same, both of them are 12.1:

(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
print(torch.version.cuda)
12.1

my gpu card is rtx 4090:

Product Name : NVIDIA GeForce RTX 4090
Product Brand : GeForce
Product Architecture : Ada Lovelace

the installation of opensora-1.0.0 is successful:

Successfully installed opensora-1.0.0

all files are in the directory of Open-Sora, in addition to that, i also put the downloaded .pth files in Open-Sora directly but any sub-directory of Open-Sora, these .pth files includes:

OpenSora-v1-16x256x256.pth
OpenSora-v1-HQ-16x512x512.pth
OpenSora-v1-HQ-16x256x256.pth

as the problems mentioned above, could any guys help me to fix them

thanks a lot~

@JThh
Copy link
Contributor

JThh commented Apr 28, 2024

Can you pip install --upgrade flash-attn --no-build-isolation?

@reich208github
Copy link
Author

Can you pip install --upgrade flash-attn --no-build-isolation?

ok, after i run it to upgrade flash-attn-2.5.6 to flash-attn-2.5.8, videos can be created now! thank you so much, my friend!

Copy link

github-actions bot commented May 6, 2024

This issue is stale because it has been open for 7 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants