Amazon SageMaker HyperPod introduces continuous provisioning for Slurm clusters
Amazon SageMaker HyperPod now supports continuous provisioning for Slurm-orchestrated clusters, enhancing flexibility and efficiency for large-scale AI/ML training workloads. This feature allows seamless scaling and immediate job initiation, reducing delays and manual intervention.
Amazon SageMaker HyperPod has expanded its capabilities to include continuous provisioning for clusters managed by the Slurm orchestrator. This update provides enhanced flexibility and efficiency for enterprise clients managing extensive AI/ML training workloads. Users operating Slurm-based clusters require rapid initiation of training sessions, seamless scalability, maintenance without operational interruptions, and detailed insight into cluster activities. Previously, if any instance group failed to be fully provisioned, the entire cluster creation or scaling process would fail, necessitating manual intervention and causing delays.
With the introduction of continuous provisioning for Slurm, SageMaker HyperPod now automatically provisions any remaining capacity in the background, allowing training jobs to begin immediately on available instances. The system prioritizes provisioning by first launching the Slurm controller node, followed by login and worker nodes in parallel, ensuring the cluster becomes operational swiftly. HyperPod also retries failed node launches asynchronously and integrates nodes into the Slurm cluster as they become available, guaranteeing that clusters reach their desired scale reliably without manual effort.
This advancement allows for concurrent, non-blocking scaling operations across multiple instance groups simultaneously, ensuring that a shortage in one group does not impede scaling in others. These improvements help clients reduce the time required to start training, optimize resource use, and concentrate on innovation rather than infrastructure management.
This feature is available for new SageMaker HyperPod clusters using the Slurm orchestrator. To enable continuous provisioning, users can set the NodeProvisioningMode parameter to “Continuous” when creating new HyperPod clusters using the CreateCluster API. Continuous provisioning can also be activated when creating new clusters through the AWS CLI and the SageMaker AI console. This feature is accessible in all AWS regions where Amazon SageMaker HyperPod is supported. For more details on continuous provisioning for Slurm clusters, refer to the Amazon SageMaker HyperPod User Guide.