RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

cjxjxjx · 2024-04-25T08:46:49Z

Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.

Steps to Reproduce
python install.py DALLE2_pytorch
python run.py DALLE2_pytorch -d cuda -t train --accuracy

Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.

Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:

fp64 golden ref were not generated for DALLE2_pytorch. Setting accuracy check to cosine
element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
  File "/benchmark/torchbenchmark/util/env_check.py", line 635, in check_accuracy
    correct_result = run_n_iterations(
  File "/benchmark/torchbenchmark/util/env_check.py", line 504, in run_n_iterations
    _model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
  File "/benchmark/torchbenchmark/util/env_check.py", line 497, in _model_iter_fn
    return forward_and_backward_pass(
  File "/benchmark/torchbenchmark/util/env_check.py", line 480, in forward_and_backward_pass
    DummyGradScaler().scale(loss).backward(retain_graph=True)
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Additional Context
PyTorch version: 2.2.2
CUDA version: 12.4.0.041

The text was updated successfully, but these errors were encountered:

xuzhao9 · 2024-04-25T17:47:19Z

I can confirm that this can be reproduced in the docker environment.
@FindHao Can you help take a look at this issue?

FindHao · 2024-04-25T19:10:28Z

@xuzhao9 The problem also occurs on the previous version of TorchBench(ghcr.io/pytorch/torchbench:dev20230619). It looks like it is from the first time DALLE2 was included in TorchBench. I'm not sure if we can fix it on our side or from the upstream repo since we have limited control over the model's init.py. I'll have a try.

xuzhao9 added the accuracy label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

cjxjxjx commented Apr 25, 2024

xuzhao9 commented Apr 25, 2024 •

edited

FindHao commented Apr 25, 2024

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

Comments

cjxjxjx commented Apr 25, 2024

xuzhao9 commented Apr 25, 2024 • edited

FindHao commented Apr 25, 2024

xuzhao9 commented Apr 25, 2024 •

edited