Amazon EKS introduces dynamic resource allocation for Elastic Fabric Adapter

Amazon EKS now supports Dynamic Resource Allocation for Elastic Fabric Adapter, enhancing inter-node communication for AI, machine learning, and HPC workloads. The new EFA DRA driver is recommended for new deployments on Kubernetes version 1.34 or later.

Amazon Elastic Kubernetes Service (Amazon EKS) has introduced support for Dynamic Resource Allocation (DRA) for Elastic Fabric Adapter (EFA), enhancing the ease of high-performance communication between nodes and Remote Direct Memory Access (RDMA) for workloads in artificial intelligence, machine learning, and High Performance Computing (HPC). This new capability is facilitated through the EFA DRA driver, which is based on the upstream DRANET project. It enables the sharing of EFA interfaces and topology-aware allocation for workloads running on Kubernetes.

The EFA DRA driver allows the allocation of EFA interfaces and accelerator devices that are linked by the same PCIe root or device group. This ensures that inter-node traffic is routed through the nearest network interface to each NVIDIA GPU, AWS Trainium, or AWS Inferentia device on the node. Additionally, it supports the sharing of EFA interfaces across multiple workloads on the same node to optimize the utilization of EFA interfaces.

The EFA DRA driver is recommended for new deployments on Amazon EKS clusters that are operating on Kubernetes version 1.34 or higher, whether they use EKS managed node groups or self-managed nodes. It is available in all AWS Regions where Amazon EKS is offered. The existing EFA device plugin continues to be supported and is suggested for use with Karpenter and Amazon EKS Auto Mode.

For further information, users can refer to the section on managing EFA devices on Amazon EKS in the Amazon EKS User Guide.