Skip to content

Latest commit

 

History

History

5.sagemaker-hyperpod

AWS SageMaker HyperPod Distributed Training Reference Architectures

Important

🚨 We recommend following the official Amazon SageMaker HyperPod Workshop to deploy clusters, which contains detailed instructions and latest best-practices. The below deployment steps are no longer kept up-to-date with latest best practices

1. Architectures

SageMaker HyperPod clusters provide the ability to create customized clusters, typically with one or more head and login nodes, and multiple compute nodes (typically P4/P5 or Trn1 instances), and optionally a shared FSX for Lustre file system. When configured with Slurm, SageMaker HyperPod provides resiliency tools to automatically identify and replace unhealthy compute nodes. Additionally, HyperPod has access to SageMaker training tools, such as SageMaker Model and Data Parallel packages, and are automatically configured for EFA.

The example that follows describes the process of setting up a SageMaker HyperPod cluster with an attached FSX for Lustre volume.

2. Prerequisites

Tip

For the latests deployment instructions, follow the Amazon SageMaker HyperPod Workshop.

Before creating a cluster, we need to install the latest AWS CLI, and setup the appropriate IAM role, VPC, FSx for Lustre volume, and S3 bucket.

2.1. Create an IAM role

First, we need to create a role with SageMaker and related resources as trusted entities.

aws iam create-role \
    --role-name AmazonSagemakerExecutionRole \
    --assume-role-policy-document file://0.AmazonSageMakerClustersExecutionRoleTrustedEntities.json

Next, we create a role policy...

aws iam create-policy \
    --policy-name AmazonSagemakerExecutionPolicy \
    --policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json

...and attach it to the role we just created.

POLICY=$(aws iam list-policies --query 'Policies[?PolicyName==`AmazonSagemakerExecutionPolicy`]' | jq '.[0].Arn' |  tr -d '"')
aws iam attach-role-policy \
    --role-name AmazonSagemakerExecutionRole \
    --policy-arn $POLICY

2.2. Create an S3 bucket

Next, we'll need an S3 bucket. This bucket will be used to store the lifecycle scripts used to setup and configure our cluster.

# generate a unique name for the bucket
BUCKET="sagemaker-lifecycle-$(python3 -S -c 'import uuid; print(str(uuid.uuid4().hex)[:10])')"

# create the bucket
aws s3 mb s3://${BUCKET}

2.3. Create VPC (Optional)

Now we can create a VPC. This is only necessary if you want to attach your HyperPod cluster to VPC specific resources. For example, to attach a shared FSx for Lustre volume to your HyperPod cluster.

You can create a VPC using the configuration in 2.SageMakerVPC.yaml. Which is also available via
 1-Click Deploy 🚀 

Feel free to change the stack and VPC names. Make sure to select an availability zone that supports your preferred instance type (Find an Amazon EC2 instance type). Leave both S3 and DynamoDB endpoints set to True. You can leave the IAM role blank.

Wait for this CloudFormation script to complete before continuing to the next step (it takes about 5 minutes).

2.3. Create FSx for Lustre (Optional)

FSx for Lustre provides a shared high performance file system that's accessible across all nodes in your HyperPod cluster.

Similar to the VPC we just created, you can create an FSx for Lustre volume using 3.FSxLustre.yaml, or by using
 1-Click Deploy 🚀 

Change the name, capacity, throughput, and compression configurations as you wish. Select the latest Lustre version (2.15 by default). On NetworkStack, make sure this matches the name (case sensitive) of the VPC CloudFormation stack you used in the previous step. Once again, you can leave the IAM role blank.

Your FSx for Lustre volume will take about 10 minutes to deploy. In the meantime, we can setup our lifecycle scripts.

3. Cluster setup

Now that we have all our infrastructure in place, we can create a cluster.

3.1 Lifecycle scripts

Lifecycle scripts tell SageMaker HyperPod how to setup your HyperPod cluster. HyperPod clusters can be launched as plain EC2 clusters with nothing installed, or can be created with configurations and users customized to fit a particular machine learning development workflow. We provide a base configuration to get started, which creates a basic Slurm cluster. Below is a brief description of what each script is doing.

Script Description
add_users.sh [Optional] creates posix users specified in a file shared_users.txt
config.py Configuration file for the lifecycle scripts.
lifecycle_script.py This is the main entrypoint, sets everything else up.
mount_fsx.sh Mounts an FSx for Lustre filesystem.
on_create.sh Entrypoint for clusters. This script calls lifecycle_script.py
provisioning_parameters.json Defines scheduler type Slurm and sets the partitions up also specifies FSx for Lustre Filesystem to attach. We'll modify this in a later step.
setup_mariadb_accounting.sh Sets up Slurm Accounting with a local mariadb server running on the HeadNode.
setup_rds_accounting.sh Sets up Slurm Accounting with a RDS endpoint.
setup_sssd.py Set up Active Directory/LDAP integration with SSSD.
shared_users_sample.txt Sample of how to specify users for the add_users.sh script.
start_slurm.sh Starts the Slurm scheduler daemon.

If you want to use docker containers, you can install Docker, Enroot, and Pyxis by setting Config.enable_docker_enroot_pyxis in config.py to True (True by default).

# Basic configuration parameters
class Config:

    # Set true if you want to install Docker/Enroot/Pyxis.
    enable_docker_enroot_pyxis = True

You can edit lifecycle_script.py for further customizations. For example, if you'd like to install Miniconda as part of your lifecycles scripts, you can add the script under utils and call it using ExecuteBashScript in lifecycle_script.py.

For now, let's just use the base configuration provided. Upload the scripts to the bucket you created earlier. This needs to be the same S3 bucket and prefix where we uploaded the other lifecycle scripts earlier.

aws s3 cp --recursive LifecycleScripts/base-config s3://${BUCKET}/LifecycleScripts/base-config

If you created an FSx for Lustre volume in the previous section, we'll need to update one file in the lifecycle scripts to attach it to our cluster.

First, get your DNSName and MountName for your file system. This can be found either in the AWS console, or by using the CLI command

aws fsx describe-file-systems

Add both to your provisioning_parameters.json file. For example,

{
  "version": "1.0.0",
  "workload_manager": "slurm",
  "controller_group": "controller-machine",
  "login_group": "my-login-group",
  "worker_groups": [
    {
      "instance_group_name": "compute-nodes",
      "partition_name": "dev"
    }
  ],
"fsx_dns_name": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
"fsx_mountname": "1abcdefg"
}

Make sure the instance_group_name matches the instance group name InstanceGroupName in your HyperPod cluster config which we create next.

Copy the updated provisioning_parameters.json to S3:

aws s3 cp LifecycleScripts/base-config/provisioning_parameters.json s3://${BUCKET}/LifecycleScripts/base-config/

Lifecycle scripts can be reused across multiple cluster. This can be handy particularly if you want to move the work saved on your FSx for Lustre volume to a new cluster.

3.2 Cluster configuration

Next we can configure our actual cluster.

In this case, we'll create a cluster with one C5 controller node, and Trn1.32xlarge compute nodes. We'll use the IAM role we created earlier. Note that InstanceGroupName for the controller node needs to match controller_group from your provisioning_parameters.json, while InstanceGroupName for your compute nodes needs to match instance_group_name under worker_groups also in provisioning_parameters.json.

ROLE=$(aws iam list-roles --query 'Roles[?RoleName==`AmazonSagemakerExecutionRole`]' | jq '.[0].Arn' |  tr -d '"')
cat > cluster-config.json << EOL
[
  {
    "InstanceGroupName": "controller-machine",
    "InstanceType": "ml.c5.xlarge",
    "InstanceCount": 1,
    "LifeCycleConfig": {
      "SourceS3Uri": "s3://${BUCKET}/LifecycleScripts/base-config/",
      "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "${ROLE}",
    "ThreadsPerCore": 1
  },
  {
    "InstanceGroupName": "compute-nodes",
    "InstanceType": "ml.trn1.32xlarge",
    "InstanceCount": 4,
    "LifeCycleConfig": {
      "SourceS3Uri": "s3://${BUCKET}/LifecycleScripts/base-config/",
      "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "${ROLE}",
    "ThreadsPerCore": 1
  }
]
EOL

And finally, if you created a VPC and FSx for Lustre volume, we need to create a configuration to make sure your HyperPod cluster is created in the correct VPC.

cat > vpc-config.json << EOL
{
  "SecurityGroupIds": ["$SECURITY_GROUP"],
  "Subnets":["$SUBNET_ID"]
}
EOL

Your SUBNET_ID can be found using

aws fsx describe-file-systems

{
    "FileSystems": [
        {
            ...
            "StorageType": "SSD",
            "VpcId": "vpc-0123456789012345a",
            "SubnetIds": [
                "subnet-01a2bc3456d78efgh"
            ],


Your SECURITY_GROUP was configured by the VPC CloudFormation stack, and begins with SageMakerVPC-SecurityGroup. Using the VPC ID from the aws fsx describe-file-systems output, you can find your SECURITY_GROUP using the CLI command

aws ec2 describe-security-groups \
    --filters 'Name=group-name,Values="SageMakerVPC-SecurityGroup*"' \
              'Name=vpc-id,Values=vpc-0123456789012345a'

3.3 Launch a new cluster

Now that everything is in place, we can launch our cluster with the command from the 5.sagemaker-hyperpod directory.

aws sagemaker create-cluster \
    --cluster-name ml-cluster \
    --instance-groups file://cluster-config.json \
    --region us-west-2 \
    --vpc-config file://vpc-config.json

You can see the current state of the cluster with

aws sagemaker describe-cluster --cluster-name ml-cluster --region us-west-2

Or list all your HyperPod cluster with

aws sagemaker list-clusters

You can see information on all your HyperPod cluster nodes with

aws sagemaker list-cluster-nodes --cluster-name ml-cluster --region us-west-2

3.4 SSH into your HyperPod cluster

To log into your HyperPod cluster, you need the cluster id from the cluster arn, instance ID of your controller node, and instance group name of your controller group. You can your HyperPod cluster ID with

aws sagemaker describe-cluster --cluster-name ml-cluster --region us-west-2

{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:123456789012:cluster/2hd31rmi9mde",
    "ClusterName": "ml-cluster",

In this case, the cluster ID is 2hd31rmi9mde

Get your controller machine instance ID with

aws sagemaker list-cluster-nodes --cluster-name ml-cluster --region us-west-2

{
    "NextToken": "",
    "ClusterNodeSummaries": [
        {
            "InstanceGroupName": "controller-machine",
            "InstanceId": "i-09e7576cbc230c181",
            "InstanceType": "ml.c5.xlarge",
            "LaunchTime": "2023-11-26T15:28:20.665000-08:00",
            "InstanceStatus": {
                "Status": "Running",
                "Message": ""
            }
        },

And login with

CLUSTER_ID=2hd31rmi9mde
CONTROLLER_GROUP=controller-machine
INSTANCE_ID=i-09e7576cbc230c181
TARGET_ID=sagemaker-cluster:${CLUSTER_ID}_${CONTROLLER_GROUP}-${INSTANCE_ID}
aws ssm start-session --target $TARGET_ID

To make this process easier, we've included an easy-ssh.sh script that takes your HyperPod cluster name and logs you in.

./easy-ssh.sh ml-cluster

If you used the base-config lifecycle scripts, you should be able to log in as ubuntu and run Slurm commands.

sudo su ubuntu
sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*         up   infinite      4   idle ip-10-1-4-190,ip-10-1-5-138,ip-10-1-18-53,ip-10-1-20-15

You'll also find your FSx for Lustre volume mounted at /fsx.

3.5 Runtime validation before running workloads

We've included a runtime validation script hyperpod-precheck.py which lets you check the runtime before running any production workloads.

In order to run the script on multiple nodes at once use srun

# this runs on 8 nodes
srun -N 8 python3 hyperpod-precheck.py

Follow the mitigations listed in this table if one of the checks fails:

Test Description Failure mitigation
check_if_docker_installed Life-cycle scripts ensure that docker is installed on all nodes
This checks if docker is available on all compute nodes
Run life-cycle scripts manually
cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh
check_enroot_runtime_path Make sure the ENROOT_RUNTIME_PATH is pointed to the right directory Follow these steps in the workshop (Cluster Setup > Docker Setup > Enroot)
check_docker_data_root Docker data root should be at /opt/dlami/nvme/data-root Run life-cycle scripts manuallycd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh
check_if_fsx_mounted df -h should show /fsx as mounted Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue.
check_if_pyxis_installed Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster Run life-cycle scripts manually cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_enroot_pyxis.sh
check_slurmd_service_status Check if slrumd is running across all compute instances Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run sudo systemctl status slurmd and find the reason. Then restart it using sudo systemctl start slurmd. If it fails again check sudo journalctl -xe to see what has gone wrong
check_if_user_directory_on_fsx This checks if users are sharing /fsx file system mount Multi user setup will create /fsx/ mounts. Follow those steps here
If the user directory doesn't exist for nodes that have been replaced
Run a variant of this command for your nodes
srun -N 2 usermod -d /fsx/ubuntu ubuntu
(Replace ubuntu with username)
nvidia_cli_installed Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue. Go to this page and look for the command that runs the nvidia-container-cli installation.
Create a script from those steps and either use sbatch or srun to execute across all compute nodes

You can also run validation on the scripts you wish to run. This ensures you’re not using unsupported operations in the script.

# Run a check on a specific sbatch script that launches training
python3 hyperpod-precheck.py -f ../../3.test_cases/1.megatron-lm/2.distributed-training.sbatch

3.6 Patching your HyperPod cluster

Run update-cluster-software to update existing HyperPod clusters with software and security patches provided by the SageMaker HyperPod service. For more details, see Update the SageMaker HyperPod platform software of a cluster in the Amazon SageMaker Developer Guide.

aws sagemaker update-cluster-software --cluster-name ml-cluster --region us-west-2

Note that this API replaces the instance root volume and cleans up data in it. You should back up your work before running it. We've included a script patching-backup.sh that can backup and restore the data via Amazon S3.

# to backup data to an S3 bucket before patching
sudo bash patching-backup.sh --create <s3-buckup-bucket-path>
# to restore data from an S3 bucket after patching
sudo bash patching-backup.sh --restore <s3-buckup-bucket-path>

3.7 Deleting your HyperPod cluster

When you're done with your HyperPod cluster, you can delete it down with

aws sagemaker delete-cluster --cluster-name ml-cluster --region us-west-2

Your FSx for Lustre volume will retain anything saved to it, and can be reattached to a future cluster.