Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to probe the nccl libraries that PyTorch uses #267

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

verdimrc
Copy link
Contributor

@verdimrc verdimrc commented Apr 16, 2024

Issue #, if available: close #252

Description of changes: Probe what PyTorch actually uses for the nccl stacks.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@verdimrc verdimrc force-pushed the probe-pt-dlopen-nccl-aws-ofi-nccl branch from cf57d36 to 07a509c Compare April 19, 2024 07:01
@perifaws
Copy link
Contributor

cancel or do we move forward with it?

@verdimrc verdimrc changed the title Script to probe the nccl libraries that PyTorch dlopens Script to probe the nccl libraries that PyTorch uses May 2, 2024
@verdimrc verdimrc force-pushed the probe-pt-dlopen-nccl-aws-ofi-nccl branch from b6f4189 to 6463723 Compare May 2, 2024 10:02
@verdimrc
Copy link
Contributor Author

verdimrc commented May 2, 2024

Ready for review.

@verdimrc verdimrc marked this pull request as ready for review May 2, 2024 10:02
Copy link
Contributor

@sean-smith sean-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this give more structured data i.e allow it to be used within other python scripts.

The scripts in this folder disambiguate the exact NCCL libraries that a PyTorch application actually
uses, in the presence of potentially multiple installed versions.

## 1. Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include usage at the top

By now, hopefully you're convinced on the need for a runtime probe to pinpoint the version of NCCL
loaded by PyTorch.

## 2. Howto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to

Via Slurm: `srun -l -N1 ./probe-pt-nccl-aws-libs.sh`

```console
$ srun -l -N1 ./probe-pt-nccl-aws-libs.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be used as a library within another script? i.e. I want to use it in my efa-versions.py to grab the nccl version but invoking the shell script and grep-ing the output is less than ideal.

0: cat /opt/amazon/efa_installed_packages:
0: # EFA installer version: 1.30.0
# Debug packages installed: no
0: # Packages installed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we exclude this


echo
echo "cat /opt/amazon/efa_installed_packages:"
cat /opt/amazon/efa_installed_packages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@verdimrc
Copy link
Contributor Author

verdimrc commented May 3, 2024

I'd like to see this give more structured data i.e allow it to be used within other python scripts

my quick 2c: for this purpose, it's best to re-implement the logic in pure python. Because the bulk of the work is done by strace. Re-purposing a shell script as a "library" that another Python script will call is just too spaghetti, and maintenance nightmare.

Plus, with Python you'll have access to proper representation of the structured object, and I think this is a better way to have a machine-readable version for Python land.

My shell script is meant for human-readability, and in it's current form is far away for the automation purpose you might have in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyTorch-based utility to recommend env vars.
3 participants