Source: examples/aws_efa
Using AWS Elastic Fabric Adapter (EFA) on HyperPod/EKS with SkyPilot#
Elastic Fabric Adapter (EFA) is an AWS alternative to Nvidia infiniband that enables high levels of inter-node communications. It is specifically useful for distributed AI training and inference, which requires high network bandwidth across nodes.
TL;DR: enable EFA with SkyPilot#
You can enable EFA on AWS HyperPod/EKS clusters with an simple additional setting in your SkyPilot YAML:
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4
Enable EFA with HyperPod/EKS#
On HyperPod (backed by EKS), EFA is enabled by default, and you don’t need to do anything.
On EKS, you may need to enable EFA with the official AWS documentation.
To check if EFA is enabled, run:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,INSTANCETYPE:.metadata.labels.node\.kubernetes\.io/instance-type,GPU:.status.allocatable.nvidia\.com/gpu,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"
You can expect a sample output like:
NAME INSTANCETYPE GPU EFA
hyperpod-i-0beea7c849d1dc614 ml.p4d.24xlarge 8 4
hyperpod-i-0da69b9076c7ff6a4 ml.p4d.24xlarge 8 4
...
Access HyperPod and run distributed job with SkyPilot#
To access HyperPod and run distributed job with SkyPilot, see the SkyPilot HyperPod example.
Adding EFA configurations in SkyPilot YAML#
To enable EFA in SkyPilot YAML, you can specify the following section in the SkyPilot YAML:
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4
This section is important for EFA integration:
config.kubernetes.pod_config
: Provides Kubernetes-specific pod configurationspec.containers[0].resources
: Defines resource requirementslimits.vpc.amazonaws.com/efa: 4
: Limits the Pod to use 4 EFA devicesrequests.vpc.amazonaws.com/efa: 4
: Requests 4 EFA devices for the Pod
The vpc.amazonaws.com/efa
resource type is exposed by the AWS EFA device plugin in Kubernetes.
To see how many EFA are available for each instance types that have EFA, see the Network cards list in the Amazon EC2 User Guide.
Check the following table for the GPU and EFA count mapping for AWS instance types:
Instance Type |
GPU Type |
#EFA |
---|---|---|
p4d.24xlarge |
A100:8 |
4 |
p4de.24xlarge |
A100:8 |
4 |
p5.48xlarge |
H100:8 |
32 |
p5e.48xlarge |
H200:8 |
32 |
p5en.48xlarge |
H200:8 |
16 |
g5.8xlarge |
A10G:1 |
1 |
g5.12xlarge |
A10G:4 |
1 |
g5.16xlarge |
A10G:1 |
1 |
g5.24xlarge |
A10G:4 |
1 |
g5.48xlarge |
A10G:8 |
1 |
g4dn.8xlarge |
T4:1 |
1 |
g4dn.12xlarge |
T4:4 |
1 |
g4dn.16xlarge |
T4:1 |
1 |
g4dn.metal |
T4:8 |
1 |
g6.8xlarge |
L4:1 |
1 |
g6.12xlarge |
L4:4 |
1 |
g6.16xlarge |
L4:1 |
1 |
g6.24xlarge |
L4:4 |
1 |
g6.48xlarge |
L4:8 |
1 |
g6e.8xlarge |
L40S:1 |
1 |
g6e.12xlarge |
L40S:4 |
1 |
g6e.16xlarge |
L40S:1 |
1 |
g6e.24xlarge |
L40S:4 |
2 |
g6e.48xlarge |
L40S:8 |
4 |
Update the EFA number in the nccl_efa.yaml
for the GPUs you use.
Running NCCL test with EFA using SkyPilot#
Check the nccl_efa.yaml
for the complete SkyPilot cluster yaml configurations.
The image_id
provides the environment setup for NCCL (NVIDIA Collective Communications Library) and EFA (Elastic Fabric Adapter).
To run the NCCL test with EFA support:
sky launch -c efa nccl_efa.yaml
SkyPilot will:
Schedule the job on a Kubernetes cluster with EFA-enabled nodes
Launch Pods with the required EFA devices
Execute the NCCL performance test with EFA networking
Output performance metrics showing the benefits of EFA for distributed training
NOTE: We can turn off EFA with
nccl_efa.yaml
by passing an env:sky launch -c efa --env USE_EFA=false nccl_efa.yaml
Benchmark results#
We compare the performance with and without EFA using NCCL test reports on the same HyperPod cluster (2x p4d.24xlarge, i.e. 2xA100:8).
The Speed-up
column is calculated by busbw EFA (GB/s) / busbw Non-EFA (GB/s)
.
Message Size |
busbw EFA (GB/s) |
busbw Non-EFA (GB/s) |
Speed-up |
---|---|---|---|
8 B |
0 |
0 |
- |
16 B |
0 |
0 |
- |
32 B |
0 |
0 |
- |
64 B |
0 |
0 |
- |
128 B |
0 |
0 |
- |
256 B |
0 |
0 |
- |
512 B |
0.01 |
0.01 |
1 x |
1 KB |
0.01 |
0.01 |
1 x |
2 KB |
0.02 |
0.02 |
1 x |
4 KB |
0.04 |
0.05 |
0.8 x |
8 KB |
0.08 |
0.06 |
1.3 x |
16 KB |
0.14 |
0.06 |
2.3 x |
32 KB |
0.25 |
0.17 |
1.4 x |
64 KB |
0.49 |
0.23 |
2.1 x |
128 KB |
0.97 |
0.45 |
2.1 x |
256 KB |
1.86 |
0.68 |
2.7 x |
512 KB |
3.03 |
1.01 |
3 x |
1 MB |
4.61 |
1.65 |
2.8 x |
2 MB |
6.5 |
1.75 |
3.7 x |
4 MB |
8.91 |
2.39 |
3.7 x |
8 MB |
10.5 |
2.91 |
3.6 x |
16 MB |
19.03 |
3.22 |
5.9 x |
32 MB |
31.85 |
3.58 |
8.9 x |
64 MB |
44.37 |
3.85 |
11.5 x |
128 MB |
54.94 |
3.87 |
14.2 x |
256 MB |
65.46 |
3.94 |
16.6 x |
512 MB |
71.83 |
4.04 |
17.7 x |
1 GB |
75.34 |
4.08 |
18.4 x |
2 GB |
77.35 |
4.13 |
18.7 x |
What stands out#
Range |
Observation |
---|---|
≤ 256 KB |
Virtually no difference — bandwidth is dominated by software/latency overhead and doesn’t reach the network’s limits. |
512 KB – 16 MB |
EFA gradually pulls ahead, hitting ~3–6 × by a few MB. |
≥ 32 MB |
The fabric really kicks in: ≥ 8 x at 32 MB, climbing to ~18 x for 1–2 GB messages. Non-EFA tops out around 4 GB/s, while EFA pushes ≈ 77 GB/s. |
EFA provides much higher throughput than the traditional TCP transport. Enabling EFA could enhance the performance of inter-instance communication significantly, which could speedup distributed AI training and inference.
Included files#
nccl_efa.yaml
name: nccl-test-efa
resources:
cloud: kubernetes
accelerators: A100:8
cpus: 90+
image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest
num_nodes: 2
envs:
USE_EFA: "true"
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
# Set environment variables
export PATH=$PATH:/usr/local/cuda-12.2/bin:/opt/amazon/efa/bin:/usr/bin
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
export NCCL_HOME=/opt/nccl
export CUDA_HOME=/usr/local/cuda-12.2
export NCCL_DEBUG=INFO
export NCCL_BUFFSIZE=8388608
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
if [ "${USE_EFA}" == "true" ]; then
export FI_PROVIDER="efa"
else
export FI_PROVIDER=""
fi
/opt/amazon/openmpi/bin/mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x FI_PROVIDER \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_BUFFSIZE \
-x NCCL_P2P_NET_CHUNKSIZE \
-x NCCL_TUNER_PLUGIN \
--mca pml ^cm,ucx \
--mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0,veth_def_agent \
/opt/nccl-tests/build/all_reduce_perf \
-b 8 \
-e 2G \
-f 2 \
-g 1 \
-c 5 \
-w 5 \
-n 100
else
echo "Worker nodes"
fi
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4