Source: examples/gcp_gpu_direct_tcpx

Using GCP GPUDirect-TCPX on A3 VM with SkyPilot#

GPUDirect-TCPX is a high-performance networking technology that enables direct communication between GPUs and network interfaces. By bypassing the CPU and system memory, it significantly enhances network performance for A3 VMs, particularly for large data transfers.

When deploying a3-highgpu-8g or a3-edgegpu-8g VMs, combining GPUDirect-TCPX with Google Virtual NIC (gVNIC) delivers optimal network performance with minimal latency between applications and the network infrastructure.

This example demonstrates how to run NCCL tests on a GCP cluster with a3-highgpu-8g VMs, comparing performance with and without GPUDirect-TCPX enabled.

TL;DR: enable GPUDirect-TCPX with SkyPilot#

Enable GPUDirect-TCPX on GCP clusters with a3-highgpu-8g or a3-edgegpu-8g VMs by adding a single configuration parameter to your SkyPilot YAML:

config:
  gcp:
    enable_gpu_direct: true

With enable_gpu_direct: true, SkyPilot automatically:

Creates a dedicated network infrastructure:
- 1 management VPC
- 4 data VPCs
- Corresponding subnets for each VPC
Provisions VMs with GPUDirect-TCPX support:
- Launches VMs with the specified instance type
- Uses GPUDirect-TCPX-compatible images
- Installs necessary GPU drivers
- Deploys the GPUDirect-TCPX Receive Data Path Manager service
- Configures NVIDIA Collective Communications Library (NCCL) and GPUDirect-TCPX plugin

Running NCCL Tests with GPUDirect-TCPX#

The complete configuration is available in gpu_direct_tcpx.yaml. The configuration includes:

image_id: Pre-configured environment for NCCL testing
instance_type: Set to a3-highgpu-8g

To run the NCCL test with GPUDirect-TCPX:

sky launch -c tcpx gpu_direct_tcpx.yaml

SkyPilot will:

Deploy a GCP cluster with GPUDirect-TCPX enabled nodes
Execute NCCL performance tests
Output detailed performance metrics

Successful GPUDirect-TCPX activation is confirmed by these log entries:

NCCL INFO NET/GPUDirectTCPX ver. 3.1.8.
NCCL INFO NET/GPUDirectTCPX : GPUDirectTCPX enable: 1

Note: To run tests without GPUDirect-TCPX, use:
sky launch -c tcpx --env USE_GPU_DIRECT=false gpu_direct_tcpx.yaml

Performance Benchmark Results#

We conducted performance comparisons using NCCL tests on a GCP cluster with 2x a3-highgpu-8g (2xH100:8) instances. The speed-up is calculated as:

Speed-up = busbw GPUDirect-TCPX (GB/s) / busbw Non-GPUDirect-TCPX (GB/s)

Message Size	busbw GPUDirect-TCPX (GB/s)	busbw Non-GPUDirect-TCPX (GB/s)	Speed-up
8 B	0	0	-
16 B	0	0	-
32 B	0	0	-
64 B	0	0	-
128 B	0	0	-
256 B	0	0	-
512 B	0	0	-
1 KB	0.01	0.01	1 x
2 KB	0.01	0.01	1 x
4 KB	0.01	0.02	0.5 x
8 KB	0.02	0.04	0.5 x
16 KB	0.04	0.09	0.4 x
32 KB	0.09	0.12	0.7 x
64 KB	0.11	0.17	0.6 x
128 KB	0.19	0.15	1.2 x
256 KB	0.35	0.23	1.5 x
512 KB	0.65	0.47	1.4 x
1 MB	1.33	0.95	1.4 x
2 MB	2.43	1.87	1.3 x
4 MB	4.8	3.64	1.3 x
8 MB	9.21	7.1	1.3 x
16 MB	17.16	8.83	1.9 x
32 MB	30.08	12.07	2.5 x
64 MB	45.31	12.48	3.6 x
128 MB	61.58	16.27	3.8 x
256 MB	67.82	20.93	3.2 x
512 MB	67.09	19.93	3.3 x
1 GB	66.2	20.09	3.3 x
2 GB	65.72	19.39	3.4 x

Key Performance Insights#

Message Size Range	Performance Characteristics
≤ 128 KB	Minimal benefit - GPUDirect-TCPX may introduce slight overhead for small messages, with comparable or lower bandwidth than non-GPUDirect mode
256 KB – 8 MB	Moderate improvement - Speedup of 1.5–1.9×, with performance crossover point at 128–256 KB
≥ 16 MB	Significant advantage - 2.5–3.8× speedup, with GPUDirect-TCPX maintaining 65–67 GB/s versus ~20 GB/s without it

GPUDirect-TCPX’s direct GPU-to-NIC communication path eliminates CPU and system memory bottlenecks, delivering superior throughput for large-scale data transfers. This makes it particularly effective for distributed deep learning workloads and high-performance computing applications.

Included files#

gpu_direct_tcpx.yaml

name: nccl-gpu-direct-tcpx

resources:
  cloud: gcp
  instance_type: a3-highgpu-8g
  image_id: docker:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx

num_nodes: 2

envs:
  USE_GPU_DIRECT: "true"

setup: |
  # Check if the /usr/local/tcpx/lib64/libnccl.so.2 is
  # present to ensure the user-data script has completed
  while [ ! -f /usr/local/tcpx/lib64/libnccl.so.2 ]; do
    echo "Waiting for user-data script to complete"
    sleep 10
  done
  # Remount the directories with exec permissions
  sudo mount -o remount,exec /usr/local/tcpx/lib64
  sudo mount -o remount,exec /usr/local/nvidia/lib64
  sudo mount -o remount,exec /usr/local/nvidia/bin


run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    # Set environment variables
    export PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_CROSS_NIC=0
    export NCCL_ALGO=Ring
    export NCCL_PROTO=Simple
    export NCCL_NSOCKS_PERTHREAD=4
    export NCCL_SOCKET_NTHREADS=1
    export NCCL_NET_GDR_LEVEL=PIX
    export NCCL_DYNAMIC_CHUNK_SIZE=524288
    export NCCL_P2P_PXN_LEVEL=0    
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_P2P_PCI_CHUNKSIZE=524288
    export NCCL_P2P_NVL_CHUNKSIZE=1048576
    export NCCL_BUFFSIZE=8388608
    export NCCL_MAX_NCHANNELS=8
    export NCCL_MIN_NCHANNELS=8
    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
    export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
    export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
    export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
    export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
    export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
    export NCCL_GPUDIRECTTCPX_FORCE_ACK=0

    if [ "${USE_GPU_DIRECT}" == "true" ]; then
      export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/tcpx/lib64
    else
      # Use the default NCCL library
      export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64
    fi
    
    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_SOCKET_IFNAME \
      -x NCCL_CROSS_NIC \
      -x NCCL_ALGO \
      -x NCCL_PROTO \
      -x NCCL_NSOCKS_PERTHREAD \
      -x NCCL_SOCKET_NTHREADS \
      -x NCCL_MAX_NCHANNELS \
      -x NCCL_MIN_NCHANNELS \
      -x NCCL_DYNAMIC_CHUNK_SIZE \
      -x NCCL_P2P_NET_CHUNKSIZE \
      -x NCCL_P2P_PCI_CHUNKSIZE \
      -x NCCL_P2P_NVL_CHUNKSIZE \
      -x NCCL_BUFFSIZE \
      -x CUDA_VISIBLE_DEVICES \
      -x NCCL_GPUDIRECTTCPX_SOCKET_IFNAME \
      -x NCCL_GPUDIRECTTCPX_CTRL_DEV \
      -x NCCL_GPUDIRECTTCPX_TX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_RX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS \
      -x NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX \
      -x NCCL_GPUDIRECTTCPX_FORCE_ACK \
      -x NCCL_NET_GDR_LEVEL \
      -x NCCL_P2P_PXN_LEVEL \
      -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ENV \
      --mca btl tcp,self \
      --mca btl_tcp_if_include eth0 \
      --mca plm_rsh_args "-p 10022" \
      /third_party/nccl-tests-mpi/build/all_reduce_perf \
      -b 8 \
      -e 2G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 20
  else
    echo "Worker nodes"
  fi

config:
  gcp:
    enable_gpu_direct: true
    managed_instance_group:
      run_duration: 36000
      provision_timeout: 900