You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the --accuracy flag is not used.
Steps to Reproduce python install.py DALLE2_pytorch python run.py DALLE2_pytorch -d cuda -t train --accuracy
Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.
Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:
fp64 golden ref were not generated for DALLE2_pytorch. Setting accuracy check to cosine
element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/benchmark/torchbenchmark/util/env_check.py", line 635, in check_accuracy
correct_result = run_n_iterations(
File "/benchmark/torchbenchmark/util/env_check.py", line 504, in run_n_iterations
_model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
File "/benchmark/torchbenchmark/util/env_check.py", line 497, in _model_iter_fn
return forward_and_backward_pass(
File "/benchmark/torchbenchmark/util/env_check.py", line 480, in forward_and_backward_pass
DummyGradScaler().scale(loss).backward(retain_graph=True)
File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/venv_cuda/pytorch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Additional Context
PyTorch version: 2.2.2
CUDA version: 12.4.0.041
The text was updated successfully, but these errors were encountered:
@xuzhao9 The problem also occurs on the previous version of TorchBench(ghcr.io/pytorch/torchbench:dev20230619). It looks like it is from the first time DALLE2 was included in TorchBench. I'm not sure if we can fix it on our side or from the upstream repo since we have limited control over the model's init.py. I'll have a try.
Issue Description
I encounter a RuntimeError related to gradient computation when enabling accuracy checks during the training of DALLE2_pytorch in a GPU docker environment. The training runs without issues when the
--accuracy
flag is not used.Steps to Reproduce
python install.py DALLE2_pytorch
python run.py DALLE2_pytorch -d cuda -t train --accuracy
Expected Behavior
The training process should run without errors and perform accuracy checks without causing runtime errors.
Actual Behavior
The script executes successfully without the --accuracy flag.
However, when the accuracy check is enabled, it fails with the following error message:
Additional Context
PyTorch version: 2.2.2
CUDA version: 12.4.0.041
The text was updated successfully, but these errors were encountered: