Update organization and tag to V1 #150

perifaws · 2024-02-22T17:07:03Z

Addresses #149

Update naming for Jax, update naming for single digit examples, new directory for best practices (EFA cheat sheet was in architectures).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mhuguesaws · 2024-02-22T20:04:41Z

Please stop numbering things. This is useless and will create plenty of challenges to add and remove things.

Why AMI is 1. and container 2. what's the logic?
efa_version.sh in 5 best practices. So best practices come last?

mhuguesaws · 2024-02-23T16:57:21Z

Here is my proposal

- docs/
- core_infra/
- orchestrators
   - aws-parallelcluster/
   - sagemaker-hyperpod/
      - slurm/
          - lifecycle-scripts/
   - amazon-eks/
   - aws-batch/
- observability/
- ml-frameworks/
   - [FRAMEWORK_NAME]
       - slurm/
       - kubernetes/
       - Dockerfile
       - README
- ml-micro-benchmarks/
- infra-validation/

mhuguesaws · 2024-02-23T17:03:11Z

@awsankur if you can comment here.

awsankur · 2024-02-23T22:02:36Z

I like this structure. A couple of comments:

Observability solution will depend on the orchestrator. So we should have an observability section as part of each orchestrator. Ideally, we should be in a position where observability is automatically enabled when we build a cluster or can be enabled in a few steps when we have built a cluster.
I am assuming [FRAMEWORK_NAME] is a test case in our current structure. I think we can bring more clarity here. So we can have a given set of [FRAMEWORK_NAMES] which include:
a. Nvidia [Nemo, Nemo-Multimodal, BioNemo etc]...we can add DALI and MONAI in the future
b. MosaicML [MPT etc]
c. PyTorch [DDP, FSDP, etc]
d. SM [DataParallel, Model Parallel, FSDP, etc]
d. TensorFlow
e. JAX

Within each FRAMEWORK_NAME we can have Dockerfiles, sbatch scripts and kubernetes yaml and other necessary files for each model name

Thoughts?

mhuguesaws · 2024-02-23T22:12:32Z

I like this structure. A couple of comments:

Observability solution will depend on the orchestrator. So we should have an observability section as part of each orchestrator. Ideally, we should be in a position where observability is automatically enabled when we build a cluster or can be enabled in a few steps when we have built a cluster.

I am assuming [FRAMEWORK_NAME] is a test case in our current structure. I think we can bring more clarity here. So we can have a given set of [FRAMEWORK_NAMES] which include:
a. Nvidia [Nemo, Nemo-Multimodal, BioNemo etc]...we can add DALI and MONAI in the future
b. MosaicML [MPT etc]
c. PyTorch [DDP, FSDP, etc]
d. SM [DataParallel, Model Parallel, FSDP, etc]
d. TensorFlow
e. JAX

Within each FRAMEWORK_NAME we can have Dockerfiles, sbatch scripts and kubernetes yaml and other necessary files for each model name

Thoughts?

Love 2. organize by "vendor"

For 1. don't think we'll go outside grafana+prometheus at this point. We can organize the observability section by orchestrator for since the deployment and setup will be different.

perifaws · 2024-02-23T22:14:24Z

@mhuguesaws how about CloudWatch or profilers like Nsight?

mhuguesaws · 2024-02-23T22:16:35Z

@mhuguesaws how about CloudWatch or profilers like Nsight?

Profiler in profiler ;)

awsankur · 2024-02-23T22:22:22Z

We should add Nsight

mhuguesaws · 2024-02-23T22:22:54Z

We should add Nsight

profilers.

KeitaW · 2024-03-11T22:37:40Z

I was wondering if observability should be under orchestrators or have subdirectories per orchestrators.

Update organization and tag

c8e202c

perifaws requested review from verdimrc, sean-smith, KeitaW, mhuguesaws and awsankur February 22, 2024 17:07

mhuguesaws mentioned this pull request Apr 11, 2024

Move nccl into micro-benchmarks #256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update organization and tag to V1 #150

Update organization and tag to V1 #150

perifaws commented Feb 22, 2024

mhuguesaws commented Feb 22, 2024 •

edited

mhuguesaws commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

awsankur commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

perifaws commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

awsankur commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

KeitaW commented Mar 11, 2024

Update organization and tag to V1 #150

Are you sure you want to change the base?

Update organization and tag to V1 #150

Conversation

perifaws commented Feb 22, 2024

mhuguesaws commented Feb 22, 2024 • edited

mhuguesaws commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

awsankur commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

perifaws commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

awsankur commented Feb 23, 2024

mhuguesaws commented Feb 23, 2024

KeitaW commented Mar 11, 2024

mhuguesaws commented Feb 22, 2024 •

edited