errors happened in the inference process #338

reich208github · 2024-04-27T07:48:31Z

hi, guys

after i run the inference command:

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt

errors are prompted, and the completed outputs are like follows:

[2024-04-27 14:23:05,034] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Config (path: configs/opensora/inference/16x256x256.py): {'num_frames': 16, 'fps': 8, 'image_size': (256, 256), 'model': {'type': 'STDiT-XL/2', 'space_scale': 0.5, 'time_scale': 1.0, 'enable_flashattn': True, 'enable_layernorm_kernel': True, 'from_pretrained': 'OpenSora-v1-HQ-16x256x256.pth'}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': 'stabilityai/sd-vae-ft-ema', 'micro_batch_size': 4}, 'text_encoder': {'type': 't5', 'from_pretrained': 'DeepFloyd/t5-v1_1-xxl', 'model_max_length': 120}, 'scheduler': {'type': 'iddpm', 'num_sampling_steps': 100, 'cfg_scale': 7.0}, 'dtype': 'fp16', 'batch_size': 1, 'seed': 42, 'prompt_path': './assets/texts/t2v_samples.txt', 'save_dir': './outputs/samples/', 'multi_resolution': False}
/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[04/27/24 14:23:14] INFO colossalai - colossalai - INFO: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.11s/it]
Missing keys: []
Unexpected keys: []
0%| | 0/100 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 112, in
main()
File "/home/yilinchen/Open-Sora/scripts/inference.py", line 93, in main
samples = scheduler.sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 72, in sample
samples = self.p_sample_loop(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 434, in p_sample_loop
for sample in self.p_sample_loop_progressive(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 485, in p_sample_loop_progressive
out = self.p_sample(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 388, in p_sample
out = self.p_mean_variance(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 94, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), args, kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/gaussian_diffusion.py", line 267, in p_mean_variance
model_output = model(x, t, model_kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/respace.py", line 127, in call
return self.model(x, new_ts, kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/schedulers/iddpm/init.py", line 89, in forward_with_cfg
model_out = model.forward(combined, timestep, y, kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 267, in forward
x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/acceleration/checkpoint.py", line 24, in auto_grad_checkpoint
return module(args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/stdit/stdit.py", line 98, in forward
x_s = self.attn(x_s)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/opensora/models/layers/blocks.py", line 152, in forward
from flash_attn import flash_attn_func
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/init**.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2
[2024-04-27 14:24:15,201] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 14511) of binary: /root/anaconda3/envs/env_open_sora/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/env_open_sora/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/env_open_sora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/inference.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-27_14:24:15
host : yilinchen-X10SRA
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14511)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

my installed softwares related with opensora-1.0.0 are like these:

apex 0.1
flash-attn 2.5.6
ninja 1.11.1.1
torch 2.1.2+cu121
torchaudio 2.1.2+cu121
torchvision 0.16.2+cu121
xformers 0.0.23.post1
packaging 24.0

the cuda version and pytorch related cuda version are same, both of them are 12.1:

(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
(env_open_sora) root@yilinchen-X10SRA:/home/yilinchen/Open-Sora# python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
print(torch.version.cuda)
12.1

my gpu card is rtx 4090:

Product Name : NVIDIA GeForce RTX 4090
Product Brand : GeForce
Product Architecture : Ada Lovelace

the installation of opensora-1.0.0 is successful:

Successfully installed opensora-1.0.0

all files are in the directory of Open-Sora, in addition to that, i also put the downloaded .pth files in Open-Sora directly but any sub-directory of Open-Sora, these .pth files includes:

OpenSora-v1-16x256x256.pth
OpenSora-v1-HQ-16x512x512.pth
OpenSora-v1-HQ-16x256x256.pth

as the problems mentioned above, could any guys help me to fix them

thanks a lot~

The text was updated successfully, but these errors were encountered:

JThh · 2024-04-28T12:31:52Z

Can you pip install --upgrade flash-attn --no-build-isolation?

reich208github · 2024-04-28T16:50:20Z

Can you pip install --upgrade flash-attn --no-build-isolation?

ok, after i run it to upgrade flash-attn-2.5.6 to flash-attn-2.5.8, videos can be created now! thank you so much, my friend!

github-actions · 2024-05-06T01:46:17Z

This issue is stale because it has been open for 7 days with no activity.

github-actions bot added the stale label May 6, 2024

zhengzangw closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors happened in the inference process #338

errors happened in the inference process #338

reich208github commented Apr 27, 2024

JThh commented Apr 28, 2024

reich208github commented Apr 28, 2024

github-actions bot commented May 6, 2024

errors happened in the inference process #338

errors happened in the inference process #338

Comments

reich208github commented Apr 27, 2024

scripts/inference.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-27_14:24:15 host : yilinchen-X10SRA rank : 0 (local_rank: 0) exitcode : 1 (pid: 14511) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

JThh commented Apr 28, 2024

reich208github commented Apr 28, 2024

github-actions bot commented May 6, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-27_14:24:15
host : yilinchen-X10SRA
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14511)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html