Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTERNAL ASSERT FAILED #203

Open
Qicheng-WANG opened this issue May 2, 2023 · 5 comments
Open

INTERNAL ASSERT FAILED #203

Qicheng-WANG opened this issue May 2, 2023 · 5 comments

Comments

@Qicheng-WANG
Copy link

Hi there,
When I ran a quick test "python3 -m tutel.examples.helloworld --batch_size=16", it showed error as follow:
RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp":46, please report a bug to PyTorch. CHECK_EQ fails.
Could you help me fix it?Thanks

@Qicheng-WANG
Copy link
Author

It also showed
image
I am using NVIDIA 3090 and CUDA11.3

@ghostplant
Copy link
Contributor

  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

@monster119120
Copy link

  1. Does print(torch.cuda.get_arch_list()) include sm_86?
  2. Can you try export USE_NVRTC=1 before running the example?
  3. Are you sure there is no other old CUDA installed so that an old nvcc command was wrongly called for this compilation?

Hi! I am running tutel in jetson nano b01 (4GB version)
I also meet problem "RuntimeError: (true) == (fp != nullptr)INTERNAL ASSERT FAILED at "/ssdisk2/tutel/tutel/custom/custom_kernel.cpp".

In the nano computer,
1.print(torch.cuda.get_arch_list() is ['sm_53', 'sm_62', 'sm72']
2. I use export USE_NVRTC=1, but another error occurred.
3. My nvcc version is 10.2.3

@ghostplant
Copy link
Contributor

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include sm_86.
You also need to update your CUDA SDK (e.g. to 12.0) since NVDIA's new GPU is not compatible with its older NVCC SDK.

@ghostplant
Copy link
Contributor

ghostplant commented Aug 12, 2023

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100 related types. After upgrading CUDA SDK, please also reinstall pytorch that is built upon at least cu118.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants