Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libfabric Error with NCCL 2.19+ #278

Open
sean-smith opened this issue Apr 19, 2024 · 0 comments
Open

Libfabric Error with NCCL 2.19+ #278

sean-smith opened this issue Apr 19, 2024 · 0 comments
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.

Comments

@sean-smith
Copy link
Contributor

If you see the following issue in your code after setting FI_INFO=info:

libfabric:652244:1713524816::core:core:cuda_set_sync_memops():207<warn> Failed to perform cuPointerSetAttribute: CUDA_ERROR_NOT_SUPPORTED:operation not supported
libfabric:652244:1713524816::efa:mr:efa_mr_hmem_setup():254<warn> unable to set memops for cuda ptr
libfabric:652244:1713524816::efa:mr:efa_mr_regattr():1014<warn> Unable to register MR: Invalid argument

you can resolve it by setting the following flag:

export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

This effects

  • EFA 1.26.0
  • NCCL 2.19+
@sean-smith sean-smith added the Troubleshooting Tips These are informational to make it easier to troubleshoot common issues. label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Troubleshooting Tips These are informational to make it easier to troubleshoot common issues.
Projects
None yet
Development

No branches or pull requests

1 participant