Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you specify which GPU to run an Mlperf benchmark on with CM? #1246

Open
kevinnegy opened this issue May 7, 2024 · 6 comments
Open

Comments

@kevinnegy
Copy link

I know that in CM you can specify --device=cuda to use GPUs instead of CPUs, but how do you choose between GPUs if you have multiple on your system?

I know onnxruntime and pytorch have their own ways of specifying GPUs, but I'm not sure which scripts to modify for each benchmark. I'm hoping there is a global CM argument for GPU index that can be applied across benchmarks. If one exists, could someone point me to it?

Thank you.

@arjunsuresh
Copy link
Contributor

@kevinnegy Are you trying to run the reference implementation using CM? Currently CM doesn't have such a flag but this can be easily added. But the problem is that the underlying implementation for each framework across all the benchmarks must also support this. Since reference implementations are meant for "reference" and not really for benchmarking, so far we haven't seen such a request. Are you targeting some specific benchmark? If so this can be done.

@kevinnegy
Copy link
Author

@arjunsuresh Yes, I'm trying to run the reference implementations using CM. My understanding was that MLPerf (including the reference implementations) could be used for benchmarking GPUs to measure performance, is that not correct?

I had hoped to benchmark the 9 reference workloads in the mlperf inference repo, but at the very least getting just Bert and RNNT to have the GPU option would be super helpful.

I really appreciate any help you can provide.

@arjunsuresh
Copy link
Contributor

@kevinnegy Reference implementations are not good to benchmark systems because most of them lack the basic optimizations like batching and multi-GPU support etc. If you want to benchmark Nvidia GPUs, Nvidia implementation is the way to go. It is supported in CM. And all of the benchmarks shouldn't take more than a day or 2 to complete except DLRM which needs days to get the dataset.

@kevinnegy
Copy link
Author

@arjunsuresh Thank you for the suggestion. I'm assuming this is what you had in mind for the Nvidia implementation of Bert with CM?

As for specifying which GPU, I was able to brute force RNNT with pytorch by switching all references in all RNNT python scripts of torch.device("cuda:0") to whatever device I want. I was unable to do the same with reference BERT with onnxruntime. Is there any hacky method like this to pick a device for BERT (reference onnxruntime and Nvidia-implementation tensorrt) just so I can get up and running in case a global CM parameter will take a while to be implemented?

@arjunsuresh
Copy link
Contributor

@kevinnegy yes, that's correct regarding Nvidia implementation.

Nvidia implementation uses a docker and I believe by passing in appropriate docker run flag, you can enable/disable the required GPUs. By default, the implementation uses all the GPUs it sees inside the docker container. If you can wait until mid next week, I should be able to give you this option via CM which can work with the 4.0 inference submissions - this is in progress now.

For onnxruntime - adding --env.CUDA_VISIBLE_DEVICES]="1" to the CM run command should help to run on device 1. But the reference implementations are at least 10X slower compared to Nvidia TensorRT implementation and so this may not be a worthwhile exercise.

@kevinnegy
Copy link
Author

@arjunsuresh Awesome! That environment variable worked for me! And yes, I can wait for the 4.0 CM option. Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants