Source: examples/gcp_gpu_direct_tcpx
Using GCP GPUDirect-TCPX on A3 VM with SkyPilot#
GPUDirect-TCPX is a high-performance networking technology that enables direct communication between GPUs and network interfaces. By bypassing the CPU and system memory, it significantly enhances network performance for A3 VMs, particularly for large data transfers.
When deploying a3-highgpu-8g
or a3-edgegpu-8g
VMs, combining GPUDirect-TCPX with Google Virtual NIC (gVNIC) delivers optimal network performance with minimal latency between applications and the network infrastructure.
This example demonstrates how to run NCCL tests on a GCP cluster with a3-highgpu-8g
VMs, comparing performance with and without GPUDirect-TCPX enabled.
TL;DR: enable GPUDirect-TCPX with SkyPilot#
Enable GPUDirect-TCPX on GCP clusters with a3-highgpu-8g
or a3-edgegpu-8g
VMs by adding a single configuration parameter to your SkyPilot YAML:
config:
gcp:
enable_gpu_direct: true
With enable_gpu_direct: true
, SkyPilot automatically:
Creates a dedicated network infrastructure:
1 management VPC
4 data VPCs
Corresponding subnets for each VPC
Provisions VMs with GPUDirect-TCPX support:
Launches VMs with the specified instance type
Uses GPUDirect-TCPX-compatible images
Installs necessary GPU drivers
Deploys the GPUDirect-TCPX Receive Data Path Manager service
Configures NVIDIA Collective Communications Library (NCCL) and GPUDirect-TCPX plugin
Running NCCL Tests with GPUDirect-TCPX#
The complete configuration is available in gpu_direct_tcpx.yaml
. The configuration includes:
image_id
: Pre-configured environment for NCCL testinginstance_type
: Set toa3-highgpu-8g
To run the NCCL test with GPUDirect-TCPX:
sky launch -c tcpx gpu_direct_tcpx.yaml
SkyPilot will:
Deploy a GCP cluster with GPUDirect-TCPX enabled nodes
Execute NCCL performance tests
Output detailed performance metrics
Successful GPUDirect-TCPX activation is confirmed by these log entries:
NCCL INFO NET/GPUDirectTCPX ver. 3.1.8.
NCCL INFO NET/GPUDirectTCPX : GPUDirectTCPX enable: 1
Note: To run tests without GPUDirect-TCPX, use:
sky launch -c tcpx --env USE_GPU_DIRECT=false gpu_direct_tcpx.yaml
Performance Benchmark Results#
We conducted performance comparisons using NCCL tests on a GCP cluster with 2x a3-highgpu-8g (2xH100:8) instances. The speed-up is calculated as:
Speed-up = busbw GPUDirect-TCPX (GB/s) / busbw Non-GPUDirect-TCPX (GB/s)
Message Size |
busbw GPUDirect-TCPX (GB/s) |
busbw Non-GPUDirect-TCPX (GB/s) |
Speed-up |
---|---|---|---|
8 B |
0 |
0 |
- |
16 B |
0 |
0 |
- |
32 B |
0 |
0 |
- |
64 B |
0 |
0 |
- |
128 B |
0 |
0 |
- |
256 B |
0 |
0 |
- |
512 B |
0 |
0 |
- |
1 KB |
0.01 |
0.01 |
1 x |
2 KB |
0.01 |
0.01 |
1 x |
4 KB |
0.01 |
0.02 |
0.5 x |
8 KB |
0.02 |
0.04 |
0.5 x |
16 KB |
0.04 |
0.09 |
0.4 x |
32 KB |
0.09 |
0.12 |
0.7 x |
64 KB |
0.11 |
0.17 |
0.6 x |
128 KB |
0.19 |
0.15 |
1.2 x |
256 KB |
0.35 |
0.23 |
1.5 x |
512 KB |
0.65 |
0.47 |
1.4 x |
1 MB |
1.33 |
0.95 |
1.4 x |
2 MB |
2.43 |
1.87 |
1.3 x |
4 MB |
4.8 |
3.64 |
1.3 x |
8 MB |
9.21 |
7.1 |
1.3 x |
16 MB |
17.16 |
8.83 |
1.9 x |
32 MB |
30.08 |
12.07 |
2.5 x |
64 MB |
45.31 |
12.48 |
3.6 x |
128 MB |
61.58 |
16.27 |
3.8 x |
256 MB |
67.82 |
20.93 |
3.2 x |
512 MB |
67.09 |
19.93 |
3.3 x |
1 GB |
66.2 |
20.09 |
3.3 x |
2 GB |
65.72 |
19.39 |
3.4 x |
Key Performance Insights#
Message Size Range |
Performance Characteristics |
---|---|
≤ 128 KB |
Minimal benefit - GPUDirect-TCPX may introduce slight overhead for small messages, with comparable or lower bandwidth than non-GPUDirect mode |
256 KB – 8 MB |
Moderate improvement - Speedup of 1.5–1.9×, with performance crossover point at 128–256 KB |
≥ 16 MB |
Significant advantage - 2.5–3.8× speedup, with GPUDirect-TCPX maintaining 65–67 GB/s versus ~20 GB/s without it |
GPUDirect-TCPX’s direct GPU-to-NIC communication path eliminates CPU and system memory bottlenecks, delivering superior throughput for large-scale data transfers. This makes it particularly effective for distributed deep learning workloads and high-performance computing applications.
Included files#
gpu_direct_tcpx.yaml
name: nccl-gpu-direct-tcpx
resources:
cloud: gcp
instance_type: a3-highgpu-8g
image_id: docker:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx
num_nodes: 2
envs:
USE_GPU_DIRECT: "true"
setup: |
# Check if the /usr/local/tcpx/lib64/libnccl.so.2 is
# present to ensure the user-data script has completed
while [ ! -f /usr/local/tcpx/lib64/libnccl.so.2 ]; do
echo "Waiting for user-data script to complete"
sleep 10
done
# Remount the directories with exec permissions
sudo mount -o remount,exec /usr/local/tcpx/lib64
sudo mount -o remount,exec /usr/local/nvidia/lib64
sudo mount -o remount,exec /usr/local/nvidia/bin
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
# Set environment variables
export PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
export NCCL_SOCKET_IFNAME=eth0
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=1
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_DYNAMIC_CHUNK_SIZE=524288
export NCCL_P2P_PXN_LEVEL=0
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export NCCL_BUFFSIZE=8388608
export NCCL_MAX_NCHANNELS=8
export NCCL_MIN_NCHANNELS=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
export NCCL_GPUDIRECTTCPX_FORCE_ACK=0
if [ "${USE_GPU_DIRECT}" == "true" ]; then
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/tcpx/lib64
else
# Use the default NCCL library
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64
fi
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_SOCKET_IFNAME \
-x NCCL_CROSS_NIC \
-x NCCL_ALGO \
-x NCCL_PROTO \
-x NCCL_NSOCKS_PERTHREAD \
-x NCCL_SOCKET_NTHREADS \
-x NCCL_MAX_NCHANNELS \
-x NCCL_MIN_NCHANNELS \
-x NCCL_DYNAMIC_CHUNK_SIZE \
-x NCCL_P2P_NET_CHUNKSIZE \
-x NCCL_P2P_PCI_CHUNKSIZE \
-x NCCL_P2P_NVL_CHUNKSIZE \
-x NCCL_BUFFSIZE \
-x CUDA_VISIBLE_DEVICES \
-x NCCL_GPUDIRECTTCPX_SOCKET_IFNAME \
-x NCCL_GPUDIRECTTCPX_CTRL_DEV \
-x NCCL_GPUDIRECTTCPX_TX_BINDINGS \
-x NCCL_GPUDIRECTTCPX_RX_BINDINGS \
-x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS \
-x NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX \
-x NCCL_GPUDIRECTTCPX_FORCE_ACK \
-x NCCL_NET_GDR_LEVEL \
-x NCCL_P2P_PXN_LEVEL \
-x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ENV \
--mca btl tcp,self \
--mca btl_tcp_if_include eth0 \
--mca plm_rsh_args "-p 10022" \
/third_party/nccl-tests-mpi/build/all_reduce_perf \
-b 8 \
-e 2G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 20
else
echo "Worker nodes"
fi
config:
gcp:
enable_gpu_direct: true
managed_instance_group:
run_duration: 36000
provision_timeout: 900