Source: examples/aws_efa
Using Elastic Fabric Adapter (EFA) on AWS with SkyPilot#
Elastic Fabric Adapter (EFA) is an AWS alternative to Nvidia infiniband that enables high levels of inter-node communications. It is specifically useful for distributed AI training and inference, which requires high network bandwidth across nodes.
Using EFA on HyperPod/EKS with SkyPilot#
TL;DR: enable EFA with SkyPilot#
You can enable EFA on AWS HyperPod/EKS clusters with an simple additional setting in your SkyPilot YAML:
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4
Enable EFA with HyperPod/EKS#
On HyperPod (backed by EKS), EFA is enabled by default, and you don’t need to do anything.
On EKS, you may need to enable EFA with the official AWS documentation.
To check if EFA is enabled, run:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,INSTANCETYPE:.metadata.labels.node\.kubernetes\.io/instance-type,GPU:.status.allocatable.nvidia\.com/gpu,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"
You can expect a sample output like:
NAME INSTANCETYPE GPU EFA
hyperpod-i-0beea7c849d1dc614 ml.p4d.24xlarge 8 4
hyperpod-i-0da69b9076c7ff6a4 ml.p4d.24xlarge 8 4
...
Access HyperPod and run distributed job with SkyPilot#
To access HyperPod and run distributed job with SkyPilot, see the SkyPilot HyperPod example.
Adding EFA configurations in SkyPilot YAML#
To enable EFA in SkyPilot YAML, you can specify the following section in the SkyPilot YAML:
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4
This section is important for EFA integration:
config.kubernetes.pod_config
: Provides Kubernetes-specific pod configurationspec.containers[0].resources
: Defines resource requirementslimits.vpc.amazonaws.com/efa: 4
: Limits the Pod to use 4 EFA devicesrequests.vpc.amazonaws.com/efa: 4
: Requests 4 EFA devices for the Pod
The vpc.amazonaws.com/efa
resource type is exposed by the AWS EFA device plugin in Kubernetes.
To see how many EFA are available for each instance types that have EFA, see the Network cards list in the Amazon EC2 User Guide.
Check the following table for the GPU and EFA count mapping for AWS instance types:
Instance Type |
GPU Type |
#EFA |
---|---|---|
p4d.24xlarge |
A100:8 |
4 |
p4de.24xlarge |
A100:8 |
4 |
p5.48xlarge |
H100:8 |
32 |
p5e.48xlarge |
H200:8 |
32 |
p5en.48xlarge |
H200:8 |
16 |
g5.8xlarge |
A10G:1 |
1 |
g5.12xlarge |
A10G:4 |
1 |
g5.16xlarge |
A10G:1 |
1 |
g5.24xlarge |
A10G:4 |
1 |
g5.48xlarge |
A10G:8 |
1 |
g4dn.8xlarge |
T4:1 |
1 |
g4dn.12xlarge |
T4:4 |
1 |
g4dn.16xlarge |
T4:1 |
1 |
g4dn.metal |
T4:8 |
1 |
g6.8xlarge |
L4:1 |
1 |
g6.12xlarge |
L4:4 |
1 |
g6.16xlarge |
L4:1 |
1 |
g6.24xlarge |
L4:4 |
1 |
g6.48xlarge |
L4:8 |
1 |
g6e.8xlarge |
L40S:1 |
1 |
g6e.12xlarge |
L40S:4 |
1 |
g6e.16xlarge |
L40S:1 |
1 |
g6e.24xlarge |
L40S:4 |
2 |
g6e.48xlarge |
L40S:8 |
4 |
Update the EFA number in the nccl_efa.yaml
for the GPUs you use.
Running NCCL test with EFA using SkyPilot#
Check the nccl_efa.yaml
for the complete SkyPilot cluster yaml configurations.
The image_id
provides the environment setup for NCCL (NVIDIA Collective Communications Library) and EFA (Elastic Fabric Adapter).
To run the NCCL test with EFA support:
sky launch -c efa nccl_efa.yaml
SkyPilot will:
Schedule the job on a Kubernetes cluster with EFA-enabled nodes
Launch Pods with the required EFA devices
Execute the NCCL performance test with EFA networking
Output performance metrics showing the benefits of EFA for distributed training
NOTE: We can turn off EFA with
nccl_efa.yaml
by passing an env:sky launch -c efa --env USE_EFA=false nccl_efa.yaml
Benchmark results#
We compare the performance with and without EFA using NCCL test reports on the same HyperPod cluster (2x p4d.24xlarge, i.e. 2xA100:8).
The Speed-up
column is calculated by busbw EFA (GB/s) / busbw Non-EFA (GB/s)
.
Message Size |
busbw EFA (GB/s) |
busbw Non-EFA (GB/s) |
Speed-up |
---|---|---|---|
8 B |
0 |
0 |
- |
16 B |
0 |
0 |
- |
32 B |
0 |
0 |
- |
64 B |
0 |
0 |
- |
128 B |
0 |
0 |
- |
256 B |
0 |
0 |
- |
512 B |
0.01 |
0.01 |
1 x |
1 KB |
0.01 |
0.01 |
1 x |
2 KB |
0.02 |
0.02 |
1 x |
4 KB |
0.04 |
0.05 |
0.8 x |
8 KB |
0.08 |
0.06 |
1.3 x |
16 KB |
0.14 |
0.06 |
2.3 x |
32 KB |
0.25 |
0.17 |
1.4 x |
64 KB |
0.49 |
0.23 |
2.1 x |
128 KB |
0.97 |
0.45 |
2.1 x |
256 KB |
1.86 |
0.68 |
2.7 x |
512 KB |
3.03 |
1.01 |
3 x |
1 MB |
4.61 |
1.65 |
2.8 x |
2 MB |
6.5 |
1.75 |
3.7 x |
4 MB |
8.91 |
2.39 |
3.7 x |
8 MB |
10.5 |
2.91 |
3.6 x |
16 MB |
19.03 |
3.22 |
5.9 x |
32 MB |
31.85 |
3.58 |
8.9 x |
64 MB |
44.37 |
3.85 |
11.5 x |
128 MB |
54.94 |
3.87 |
14.2 x |
256 MB |
65.46 |
3.94 |
16.6 x |
512 MB |
71.83 |
4.04 |
17.7 x |
1 GB |
75.34 |
4.08 |
18.4 x |
2 GB |
77.35 |
4.13 |
18.7 x |
What stands out#
Range |
Observation |
---|---|
≤ 256 KB |
Virtually no difference — bandwidth is dominated by software/latency overhead and doesn’t reach the network’s limits. |
512 KB – 16 MB |
EFA gradually pulls ahead, hitting ~3–6 × by a few MB. |
≥ 32 MB |
The fabric really kicks in: ≥ 8 x at 32 MB, climbing to ~18 x for 1–2 GB messages. Non-EFA tops out around 4 GB/s, while EFA pushes ≈ 77 GB/s. |
EFA provides much higher throughput than the traditional TCP transport. Enabling EFA could enhance the performance of inter-instance communication significantly, which could speedup distributed AI training and inference.
Using EFA on AWS VM#
For the instance types listed in the GPU and EFA count mapping table in the Adding EFA configurations in SkyPilot YAML section, the EFA can be enabled by setting resources.network_tier: best
in the task YAML.
resources:
network_tier: best
To run the NCCL test with EFA support with AWS VM:
sky launch -c efa efa_vm.yaml
Check the efa_vm.yaml
for the complete SkyPilot cluster yaml configurations.
Benchmark results#
We compare the performance with and without EFA using NCCL test reports with the same resources (2x p4d.24xlarge, i.e. 2xA100:8).
The Speed-up 1 EFA vs 1 ENI
column is calculated by busbw with 1 EFA Interface (GB/s) / busbw Non-EFA with 1 Network Interface (GB/s)
and the Speed-up 4 EFA vs 1 ENI
column is calculated by busbw with 4 EFA Interfaces (GB/s) / busbw Non-EFA with 1 Network Interface (GB/s)
.
Message Size |
busbw with 4 EFA Interfaces (GB/s) |
busbw with 1 EFA Interface (GB/s) |
busbw Non-EFA with 1 Network Interface (GB/s) |
Speed-up 1 EFA vs 1 ENI |
Speed-up 4 EFA vs 1 ENI |
---|---|---|---|---|---|
8 B |
0 |
0 |
0 |
- |
- |
16 B |
0 |
0 |
0 |
- |
- |
32 B |
0 |
0 |
0 |
- |
- |
64 B |
0 |
0 |
0 |
- |
- |
128 B |
0 |
0 |
0 |
- |
- |
256 B |
0 |
0 |
0 |
- |
- |
512 B |
0.01 |
0.01 |
0.01 |
1 x |
1 x |
1 KB |
0.01 |
0.01 |
0.02 |
0.5 x |
0.5 x |
2 KB |
0.02 |
0.02 |
0.03 |
0.6 x |
0.6 x |
4 KB |
0.04 |
0.04 |
0.05 |
0.8 x |
0.8 x |
8 KB |
0.09 |
0.08 |
0.08 |
1.0 x |
1.1 x |
16 KB |
0.16 |
0.15 |
0.10 |
1.5 x |
1.6 x |
32 KB |
0.28 |
0.26 |
0.16 |
1.6 x |
1.7 x |
64 KB |
0.54 |
0.50 |
0.29 |
1.7 x |
1.8 x |
128 KB |
1.07 |
0.81 |
0.45 |
1.8 x |
2.4 x |
256 KB |
2.02 |
1.23 |
0.74 |
1.6 x |
2.7 x |
512 KB |
3.28 |
1.70 |
0.85 |
2.0 x |
3.8 x |
1 MB |
4.97 |
2.34 |
1.52 |
1.5 x |
3.2 x |
2 MB |
6.77 |
2.80 |
2.35 |
1.2 x |
2.9 x |
4 MB |
9.28 |
4.96 |
3.79 |
1.3 x |
2.4 x |
8 MB |
10.99 |
8.21 |
5.37 |
1.5 x |
2.0 x |
16 MB |
19.63 |
11.84 |
6.46 |
1.8 x |
3.0 x |
32 MB |
32.62 |
14.93 |
7.27 |
2.0 x |
4.5 x |
64 MB |
46.11 |
17.27 |
6.83 |
2.5 x |
6.7 x |
128 MB |
57.79 |
18.67 |
7.93 |
2.3 x |
7.3 x |
256 MB |
67.20 |
19.59 |
8.03 |
2.4 x |
8.3 x |
512 MB |
72.90 |
19.99 |
8.14 |
2.4 x |
8.9 x |
1 GB |
76.14 |
20.19 |
8.15 |
2.5 x |
9.3 x |
2 GB |
77.90 |
20.31 |
8.20 |
2.5 x |
9.5 x |
From the above benchmark results, we can see that:
EFA brings little benefit for small messages, but gains grow with message size.
Bandwidth scales near-linearly with multiple EFAs, reaching ~78 GB/s with 4 interfaces.
On p4d.24xlarge (A100×8), 4 EFAs deliver up to ~9.5× higher bandwidth vs a single ENI.
So EFA is critical for scalable, high-throughput training workloads.
Included files#
efa_container.yaml
# This example is used to test the NCCL performance
# with EFA with container runtime on AWS VM.
name: nccl-efa-container
resources:
infra: aws
instance_type: p4d.24xlarge
image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest
network_tier: best
num_nodes: 2
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
/opt/amazon/openmpi/bin/mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--mca pml ^cm,ucx \
--mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0,veth_def_agent \
--mca plm_rsh_args "-p 10022" \
/opt/nccl-tests/build/all_reduce_perf \
-b 8 \
-e 2G \
-f 2 \
-g 1 \
-c 5 \
-w 5 \
-n 100
else
echo "Worker nodes"
fi
efa_vm.yaml
# This example is used to test the NCCL performance
# with EFA on AWS VMs.
name: nccl-efa-vm
resources:
infra: aws
instance_type: p4d.24xlarge
network_tier: best
num_nodes: 2
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
/opt/amazon/openmpi/bin/mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--mca pml ^cm,ucx \
--mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0,veth_def_agent \
/usr/local/cuda/efa/test-cuda-12.8/all_reduce_perf \
-b 8 \
-e 2G \
-f 2 \
-g 1 \
-c 5 \
-w 5 \
-n 100
else
echo "Worker nodes"
fi
nccl_efa.yaml
# This example is used to test the NCCL performance
# with EFA on HyperPod/EKS.
name: nccl-efa-eks
resources:
infra: k8s
accelerators: A100:8
cpus: 90+
image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest
num_nodes: 2
envs:
USE_EFA: "true"
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
# Set environment variables
export PATH=$PATH:/usr/local/cuda-12.2/bin:/opt/amazon/efa/bin:/usr/bin
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
export NCCL_HOME=/opt/nccl
export CUDA_HOME=/usr/local/cuda-12.2
export NCCL_DEBUG=INFO
export NCCL_BUFFSIZE=8388608
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
if [ "${USE_EFA}" == "true" ]; then
export FI_PROVIDER="efa"
else
export FI_PROVIDER=""
fi
/opt/amazon/openmpi/bin/mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x FI_PROVIDER \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_BUFFSIZE \
-x NCCL_P2P_NET_CHUNKSIZE \
-x NCCL_TUNER_PLUGIN \
--mca pml ^cm,ucx \
--mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0,veth_def_agent \
/opt/nccl-tests/build/all_reduce_perf \
-b 8 \
-e 2G \
-f 2 \
-g 1 \
-c 5 \
-w 5 \
-n 100
else
echo "Worker nodes"
fi
config:
kubernetes:
pod_config:
spec:
containers:
- resources:
limits:
vpc.amazonaws.com/efa: 4
requests:
vpc.amazonaws.com/efa: 4