Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError When Enabling Accuracy Checks in DALLE2_pytorch Training on GPU. #2241

Open
cjxjxjx opened this issue Apr 25, 2024 · 2 comments
Labels

Comments

@cjxjxjx
Copy link

cjxjxjx commented Apr 25, 2024

Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.

Steps to Reproduce
python install.py DALLE2_pytorch
python run.py DALLE2_pytorch -d cuda -t train --accuracy

Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.

Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:

fp64 golden ref were not generated for DALLE2_pytorch. Setting accuracy check to cosine
element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
  File "/benchmark/torchbenchmark/util/env_check.py", line 635, in check_accuracy
    correct_result = run_n_iterations(
  File "/benchmark/torchbenchmark/util/env_check.py", line 504, in run_n_iterations
    _model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
  File "/benchmark/torchbenchmark/util/env_check.py", line 497, in _model_iter_fn
    return forward_and_backward_pass(
  File "/benchmark/torchbenchmark/util/env_check.py", line 480, in forward_and_backward_pass
    DummyGradScaler().scale(loss).backward(retain_graph=True)
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Additional Context
PyTorch version: 2.2.2
CUDA version: 12.4.0.041

@xuzhao9
Copy link
Contributor

xuzhao9 commented Apr 25, 2024

I can confirm that this can be reproduced in the docker environment.
@FindHao Can you help take a look at this issue?

@FindHao
Copy link
Contributor

FindHao commented Apr 25, 2024

@xuzhao9 The problem also occurs on the previous version of TorchBench(ghcr.io/pytorch/torchbench:dev20230619). It looks like it is from the first time DALLE2 was included in TorchBench. I'm not sure if we can fix it on our side or from the upstream repo since we have limited control over the model's init.py. I'll have a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants