Source: examples/gcp_gpu_direct_tcpx

Using High-Performance GPU Networking on GCP and GKE with SkyPilot#

SkyPilot supports advanced GPU networking technologies on both GCP VMs and GKE clusters, enabling high-performance inter-GPU communication for distributed deep learning and HPC workloads. This includes support for:

GPUDirect-TCPX: a3-highgpu-8g, a3-edgegpu-8g (H100)
GPUDirect-RDMA: a3-ultragpu-8g (H200), a4-highgpu-8g (B200)

NCCL Test Example YAMLs#

We offer example YAMLs for running NCCL tests to verify high-performance GPU networking on GCP, covering both VM-based and GKE-based deployments:

Configuration	Target Platform	GPU Networking Technology	VM Types
`nccl_tcpx_gcpvm_h100.yaml`	GCP VM	GPUDirect-TCPX	a3-highgpu-8g, a3-edgegpu-8g
`nccl_tcpx_gke_h100.yaml`	GKE	GPUDirect-TCPX	a3-highgpu-8g, a3-edgegpu-8g
`nccl_rdma_gke_h200.yaml`	GKE	GPUDirect-RDMA	a3-ultragpu-8g

GKE: Using High-Performance GPU Networking on GKE#

SkyPilot supports advanced GPU networking GKE clusters, including GPUDirect-TCPX and, GPUDirect-RDMA by simply setting network_tier: best:

resources:
  ...
  network_tier: best # Turn on GPUDirect if available 

To make sure your cluster is set up correctly, refer to the GKE documentation to setup the cluster with the appropriate networking configuration.

In addition to creating a node pool with fixed node size to request the desired GPU instances, you can also use Dynamic Workload Scheduler (DWS) on GKE to provision the nodes, refer to using DWS on GKE for more details.

After setting up the GKE cluster, you can run the appropriate NCCL tests for your GPUs:

GPUDirect-RDMA on A3-ultragpu-8g (H200)#

We validated GPUDirect-RDMA performance on a3-ultragpu-8g instances with H200 GPUs. Testing was conducted on a 2-node cluster with 16x H200 GPUs (8 per node) using the configuration in nccl_rdma_gke_h200.yaml.

The scaling curves and bandwidth measurements match exactly with Google’s official benchmarks shown in their documentation.

Running `nccl_rdma_gke_h200.yaml` on SkyPilot#

$ sky launch -c nccl nccl_rdma_gke_h200.yaml
...
(head, rank=0, pid=3808) All nodes: 10.100.9.12:8,10.100.10.12:8
(worker1, rank=1, pid=2769, ip=10.100.10.12) Worker nodes
(head, rank=0, pid=3808) [1,0]<stdout>:# nThread 1 nGpus 1 minBytes 1024 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
(head, rank=0, pid=3808) [1,0]<stdout>:#
(head, rank=0, pid=3808) [1,0]<stdout>:# Using devices
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  0 Group  0 Pid   7774 on nc-d87e1263-head device  0 [0000:8f:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  1 Group  0 Pid   7775 on nc-d87e1263-head device  1 [0000:90:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  2 Group  0 Pid   7776 on nc-d87e1263-head device  2 [0000:96:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  3 Group  0 Pid   7778 on nc-d87e1263-head device  3 [0000:97:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  4 Group  0 Pid   7783 on nc-d87e1263-head device  4 [0000:c4:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  5 Group  0 Pid   7786 on nc-d87e1263-head device  5 [0000:c5:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  6 Group  0 Pid   7789 on nc-d87e1263-head device  6 [0000:cb:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  7 Group  0 Pid   7792 on nc-d87e1263-head device  7 [0000:cc:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  8 Group  0 Pid   4926 on nc-d87e1263-worker1 device  0 [0000:8f:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank  9 Group  0 Pid   4927 on nc-d87e1263-worker1 device  1 [0000:90:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 10 Group  0 Pid   4928 on nc-d87e1263-worker1 device  2 [0000:96:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 11 Group  0 Pid   4929 on nc-d87e1263-worker1 device  3 [0000:97:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 12 Group  0 Pid   4931 on nc-d87e1263-worker1 device  4 [0000:c4:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 13 Group  0 Pid   4934 on nc-d87e1263-worker1 device  5 [0000:c5:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 14 Group  0 Pid   4937 on nc-d87e1263-worker1 device  6 [0000:cb:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#  Rank 15 Group  0 Pid   4940 on nc-d87e1263-worker1 device  7 [0000:cc:00] NVIDIA H200
(head, rank=0, pid=3808) [1,0]<stdout>:#
(head, rank=0, pid=3808) [1,0]<stdout>:#                                                              out-of-place                       in-place
(head, rank=0, pid=3808) [1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
(head, rank=0, pid=3808) [1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
(head, rank=0, pid=3808) [1,0]<stdout>:        1024            16     float    none      -1    28.39    0.04    0.03      0    28.07    0.04    0.03      0
(head, rank=0, pid=3808) [1,0]<stdout>:        2048            32     float    none      -1    28.29    0.07    0.07      0    28.24    0.07    0.07      0
(head, rank=0, pid=3808) [1,0]<stdout>:        4096            64     float    none      -1    28.85    0.14    0.13      0    28.64    0.14    0.13      0
(head, rank=0, pid=3808) [1,0]<stdout>:        8192           128     float    none      -1    34.12    0.24    0.23      0    32.47    0.25    0.24      0
(head, rank=0, pid=3808) [1,0]<stdout>:       16384           256     float    none      -1    33.28    0.49    0.46      0    33.29    0.49    0.46      0
(head, rank=0, pid=3808) [1,0]<stdout>:       32768           512     float    none      -1    34.80    0.94    0.88      0    34.86    0.94    0.88      0
(head, rank=0, pid=3808) [1,0]<stdout>:       65536          1024     float    none      -1    40.62    1.61    1.51      0    41.22    1.59    1.49      0
(head, rank=0, pid=3808) [1,0]<stdout>:      131072          2048     float    none      -1    36.17    3.62    3.40      0    40.87    3.21    3.01      0
(head, rank=0, pid=3808) [1,0]<stdout>:      262144          4096     float    none      -1    41.81    6.27    5.88      0    37.99    6.90    6.47      0
(head, rank=0, pid=3808) [1,0]<stdout>:      524288          8192     float    none      -1    43.98   11.92   11.18      0    45.96   11.41   10.69      0
(head, rank=0, pid=3808) [1,0]<stdout>:     1048576         16384     float    none      -1    58.46   17.94   16.82      0    54.81   19.13   17.93      0
(head, rank=0, pid=3808) [1,0]<stdout>:     2097152         32768     float    none      -1    68.40   30.66   28.74      0    70.87   29.59   27.74      0
(head, rank=0, pid=3808) [1,0]<stdout>:     4194304         65536     float    none      -1    76.56   54.78   51.36      0    76.13   55.09   51.65      0
(head, rank=0, pid=3808) [1,0]<stdout>:     8388608        131072     float    none      -1    86.92   96.51   90.47      0    85.77   97.81   91.70      0
(head, rank=0, pid=3808) [1,0]<stdout>:    16777216        262144     float    none      -1    116.2  144.43  135.41      0    114.9  146.00  136.87      0
(head, rank=0, pid=3808) [1,0]<stdout>:    33554432        524288     float    none      -1    174.4  192.45  180.42      0    172.4  194.66  182.49      0
(head, rank=0, pid=3808) [1,0]<stdout>:    67108864       1048576     float    none      -1    278.1  241.27  226.19      0    270.5  248.10  232.59      0
(head, rank=0, pid=3808) [1,0]<stdout>:   134217728       2097152     float    none      -1    499.6  268.67  251.88      0    483.8  277.44  260.10      0
(head, rank=0, pid=3808) [1,0]<stdout>:   268435456       4194304     float    none      -1    885.5  303.16  284.21      0    870.7  308.30  289.03      0
(head, rank=0, pid=3808) [1,0]<stdout>:   536870912       8388608     float    none      -1   1575.6  340.75  319.45      0   1568.7  342.24  320.85      0
(head, rank=0, pid=3808) [1,0]<stdout>:  1073741824      16777216     float    none      -1   3123.9  343.72  322.23      0   3079.7  348.65  326.86      0
(head, rank=0, pid=3808) [1,0]<stdout>:  2147483648      33554432     float    none      -1   6229.6  344.72  323.18      0   6107.6  351.61  329.63      0
(head, rank=0, pid=3808) [1,0]<stdout>:  4294967296      67108864     float    none      -1    12416  345.92  324.30      0    12133  354.00  331.87      0
(head, rank=0, pid=3808) [1,0]<stdout>:  8589934592     134217728     float    none      -1    24724  347.44  325.72      0    24214  354.75  332.58      0
(head, rank=0, pid=3808) [1,0]<stdout>:# Out of bounds values : 0 OK
(head, rank=0, pid=3808) [1,0]<stdout>:# Avg bus bandwidth    : 122.073
(head, rank=0, pid=3808) [1,0]<stdout>:#

Comparing with raw NCCL test pods from GCP documentation#

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4.yaml
...
$ kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2
...
# nThread 1 nGpus 1 minBytes 1024 maxBytes 8589934592 step: 2(factor) warmup iters: 50 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  13581 on nccl-test-host-1 device  0 [0000:8f:00] NVIDIA H200
#  Rank  1 Group  0 Pid  13779 on nccl-test-host-1 device  1 [0000:90:00] NVIDIA H200
#  Rank  2 Group  0 Pid  13629 on nccl-test-host-1 device  2 [0000:96:00] NVIDIA H200
#  Rank  3 Group  0 Pid  13795 on nccl-test-host-1 device  3 [0000:97:00] NVIDIA H200
#  Rank  4 Group  0 Pid  13790 on nccl-test-host-1 device  4 [0000:c4:00] NVIDIA H200
#  Rank  5 Group  0 Pid  13750 on nccl-test-host-1 device  5 [0000:c5:00] NVIDIA H200
#  Rank  6 Group  0 Pid  13751 on nccl-test-host-1 device  6 [0000:cb:00] NVIDIA H200
#  Rank  7 Group  0 Pid  13754 on nccl-test-host-1 device  7 [0000:cc:00] NVIDIA H200
#  Rank  8 Group  0 Pid  13708 on nccl-test-host-2 device  0 [0000:8f:00] NVIDIA H200
#  Rank  9 Group  0 Pid  13749 on nccl-test-host-2 device  1 [0000:90:00] NVIDIA H200
#  Rank 10 Group  0 Pid  13728 on nccl-test-host-2 device  2 [0000:96:00] NVIDIA H200
#  Rank 11 Group  0 Pid  13735 on nccl-test-host-2 device  3 [0000:97:00] NVIDIA H200
#  Rank 12 Group  0 Pid  13648 on nccl-test-host-2 device  4 [0000:c4:00] NVIDIA H200
#  Rank 13 Group  0 Pid  13685 on nccl-test-host-2 device  5 [0000:c5:00] NVIDIA H200
#  Rank 14 Group  0 Pid  13653 on nccl-test-host-2 device  6 [0000:cb:00] NVIDIA H200
#  Rank 15 Group  0 Pid  13751 on nccl-test-host-2 device  7 [0000:cc:00] NVIDIA H200
NCCL version 2.26.6+cuda12.8
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024            16     float    none      -1    28.70    0.04    0.03      0    27.82    0.04    0.03      0
        2048            32     float    none      -1    28.05    0.07    0.07      0    28.03    0.07    0.07      0
        4096            64     float    none      -1    28.39    0.14    0.14      0    28.35    0.14    0.14      0
        8192           128     float    none      -1    31.44    0.26    0.24      0    31.23    0.26    0.25      0
       16384           256     float    none      -1    31.61    0.52    0.49      0    31.70    0.52    0.48      0
       32768           512     float    none      -1    32.81    1.00    0.94      0    32.76    1.00    0.94      0
       65536          1024     float    none      -1    34.77    1.88    1.77      0    34.58    1.90    1.78      0
      131072          2048     float    none      -1    34.65    3.78    3.55      0    37.37    3.51    3.29      0
      262144          4096     float    none      -1    39.47    6.64    6.23      0    37.89    6.92    6.49      0
      524288          8192     float    none      -1    42.75   12.26   11.50      0    40.59   12.92   12.11      0
     1048576         16384     float    none      -1    54.70   19.17   17.97      0    53.14   19.73   18.50      0
     2097152         32768     float    none      -1    69.52   30.16   28.28      0    66.52   31.52   29.55      0
     4194304         65536     float    none      -1    77.41   54.18   50.79      0    71.70   58.50   54.84      0
     8388608        131072     float    none      -1    87.98   95.35   89.39      0    86.18   97.34   91.25      0
    16777216        262144     float    none      -1    117.6  142.68  133.76      0    123.6  135.69  127.21      0
    33554432        524288     float    none      -1    177.2  189.36  177.53      0    176.9  189.66  177.81      0
    67108864       1048576     float    none      -1    277.8  241.56  226.47      0    271.8  246.94  231.50      0
   134217728       2097152     float    none      -1    493.7  271.86  254.87      0    486.3  276.01  258.76      0
   268435456       4194304     float    none      -1    876.1  306.38  287.23      0    870.5  308.38  289.11      0
   536870912       8388608     float    none      -1   1580.2  339.74  318.51      0   1568.3  342.32  320.93      0
  1073741824      16777216     float    none      -1   3126.8  343.40  321.93      0   3084.1  348.16  326.40      0
  2147483648      33554432     float    none      -1   6218.0  345.36  323.78      0   6097.1  352.22  330.20      0
  4294967296      67108864     float    none      -1    12400  346.38  324.73      0    12135  353.94  331.82      0
  8589934592     134217728     float    none      -1    24739  347.22  325.52      0    24244  354.31  332.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 121.902
#

GCP VMs: GPUDirect-TCPX on A3 VMs (H100)#

Note: The following instructions apply only to GCP VMs (a3-highgpu-8g and a3-edgegpu-8g). For GKE clusters, see the GKE: Using High-Performance GPU Networking on GKE above.

This example demonstrates how to run NCCL tests on a GCP cluster with A3 VMs, comparing performance with and without GPUDirect-TCPX enabled.

Enable GPUDirect-TCPX on GCP clusters with a3-highgpu-8g or a3-edgegpu-8g VMs by adding a single configuration parameter to your SkyPilot YAML:

config:
  gcp:
    enable_gpu_direct: true

With enable_gpu_direct: true, SkyPilot automatically:

Creates a dedicated network infrastructure:
- 1 management VPC
- 4 data VPCs
- Corresponding subnets for each VPC
Provisions VMs with GPUDirect-TCPX support:
- Launches VMs with the specified instance type
- Uses GPUDirect-TCPX-compatible images
- Installs necessary GPU drivers
- Deploys the GPUDirect-TCPX Receive Data Path Manager service
- Configures NVIDIA Collective Communications Library (NCCL) and GPUDirect-TCPX plugin

Running NCCL Tests with GPUDirect-TCPX#

The complete configuration is available in nccl_tcpx_gcpvm_h100.yaml. The configuration includes:

image_id: Pre-configured environment for NCCL testing
instance_type: Set to a3-highgpu-8g

To run the NCCL test with GPUDirect-TCPX:

sky launch -c tcpx nccl_tcpx_gcpvm_h100.yaml

SkyPilot will:

Deploy a GCP cluster with GPUDirect-TCPX enabled nodes
Execute NCCL performance tests
Output detailed performance metrics

Successful GPUDirect-TCPX activation is confirmed by these log entries:

NCCL INFO NET/GPUDirectTCPX ver. 3.1.8.
NCCL INFO NET/GPUDirectTCPX : GPUDirectTCPX enable: 1

Note: To run tests without GPUDirect-TCPX, use:
sky launch -c tcpx --env USE_GPU_DIRECT=false nccl_tcpx_gcpvm_h100.yaml

Performance Benchmark Results#

We conducted performance comparisons using NCCL tests on a GCP cluster with 2x a3-highgpu-8g (2xH100:8) instances. The speed-up is calculated as:

Speed-up = busbw GPUDirect-TCPX (GB/s) / busbw Non-GPUDirect-TCPX (GB/s)

Message Size	busbw GPUDirect-TCPX (GB/s)	busbw Non-GPUDirect-TCPX (GB/s)	Speed-up
8 B	0	0	-
16 B	0	0	-
32 B	0	0	-
64 B	0	0	-
128 B	0	0	-
256 B	0	0	-
512 B	0	0	-
1 KB	0.01	0.01	1 x
2 KB	0.01	0.01	1 x
4 KB	0.01	0.02	0.5 x
8 KB	0.02	0.04	0.5 x
16 KB	0.04	0.09	0.4 x
32 KB	0.09	0.12	0.7 x
64 KB	0.11	0.17	0.6 x
128 KB	0.19	0.15	1.2 x
256 KB	0.35	0.23	1.5 x
512 KB	0.65	0.47	1.4 x
1 MB	1.33	0.95	1.4 x
2 MB	2.43	1.87	1.3 x
4 MB	4.8	3.64	1.3 x
8 MB	9.21	7.1	1.3 x
16 MB	17.16	8.83	1.9 x
32 MB	30.08	12.07	2.5 x
64 MB	45.31	12.48	3.6 x
128 MB	61.58	16.27	3.8 x
256 MB	67.82	20.93	3.2 x
512 MB	67.09	19.93	3.3 x
1 GB	66.2	20.09	3.3 x
2 GB	65.72	19.39	3.4 x

Key Performance Insights#

Message Size Range	Performance Characteristics
≤ 128 KB	Minimal benefit - GPUDirect-TCPX may introduce slight overhead for small messages, with comparable or lower bandwidth than non-GPUDirect mode
256 KB – 8 MB	Moderate improvement - Speedup of 1.5–1.9×, with performance crossover point at 128–256 KB
≥ 16 MB	Significant advantage - 2.5–3.8× speedup, with GPUDirect-TCPX maintaining 65–67 GB/s versus ~20 GB/s without it

GPUDirect-TCPX’s direct GPU-to-NIC communication path eliminates CPU and system memory bottlenecks, delivering superior throughput for large-scale data transfers. This makes it particularly effective for distributed deep learning workloads and high-performance computing applications.

Included files#

nccl_rdma_gke_h200.yaml

# This example is used to test the NCCL performance with
# GPUDirect RDMA on GKE for H200 instances (a3-ultragpu-8g).
#
# Usage:
# sky launch -c nccl examples/gcp_gpu_direct_tcpx/nccl_rdma_gke_h200.yaml
name: nccl-gpudirect-rdma-h200

resources:
  infra: k8s
  accelerators: H200:8
  image_id: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.6
  network_tier: best

num_nodes: 2

envs:
  USE_GPU_DIRECT: "true"

run: |
  /scripts/container_entry.sh shell
  source /usr/local/gib/scripts/set_nccl_env.sh

  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    # World Level = 0x0, Rail Aligned = 0x7
    export NCCL_TESTS_SPLIT_MASK="0x0";

    # Force use of libnccl-gib
    export NCCL_NET=gIB

    # Set all the correct libnccl-gib environment variables
    source /usr/local/gib/scripts/set_nccl_env.sh

    # Get all relevant NCCL / env vars to pass to all workers
    ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')

    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x $ENV_VARS  \
      -mca plm_rsh_no_tree_spawn 1 \
      --mca orte_keep_fqdn_hostnames 1 \
      --mca btl self,tcp \
      --mca btl_tcp_if_include eth0 \
      --bind-to none \
      --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
      /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1

  else
    echo "Worker nodes"
  fi

nccl_tcpx_gcpvm_h100.yaml

name: nccl-gpu-direct-tcpx

resources:
  infra: gcp
  instance_type: a3-highgpu-8g
  image_id: docker:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx

num_nodes: 2

envs:
  USE_GPU_DIRECT: "true"

setup: |
  # Check if the /usr/local/tcpx/lib64/libnccl.so.2 is
  # present to ensure the user-data script has completed
  while [ ! -f /usr/local/tcpx/lib64/libnccl.so.2 ]; do
    echo "Waiting for user-data script to complete"
    sleep 10
  done
  # Remount the directories with exec permissions
  sudo mount -o remount,exec /usr/local/tcpx/lib64
  sudo mount -o remount,exec /usr/local/nvidia/lib64
  sudo mount -o remount,exec /usr/local/nvidia/bin


run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    # Set environment variables
    export PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
    export NCCL_SOCKET_IFNAME=eth0
    export NCCL_CROSS_NIC=0
    export NCCL_ALGO=Ring
    export NCCL_PROTO=Simple
    export NCCL_NSOCKS_PERTHREAD=4
    export NCCL_SOCKET_NTHREADS=1
    export NCCL_NET_GDR_LEVEL=PIX
    export NCCL_DYNAMIC_CHUNK_SIZE=524288
    export NCCL_P2P_PXN_LEVEL=0    
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_P2P_PCI_CHUNKSIZE=524288
    export NCCL_P2P_NVL_CHUNKSIZE=1048576
    export NCCL_BUFFSIZE=8388608
    export NCCL_MAX_NCHANNELS=8
    export NCCL_MIN_NCHANNELS=8
    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4
    export NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0
    export NCCL_GPUDIRECTTCPX_TX_BINDINGS="eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177"
    export NCCL_GPUDIRECTTCPX_RX_BINDINGS="eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191"
    export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=50000
    export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX="/run/tcpx"
    export NCCL_GPUDIRECTTCPX_FORCE_ACK=0

    if [ "${USE_GPU_DIRECT}" == "true" ]; then
      export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/tcpx/lib64
    else
      # Use the default NCCL library
      export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64
    fi
    
    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_SOCKET_IFNAME \
      -x NCCL_CROSS_NIC \
      -x NCCL_ALGO \
      -x NCCL_PROTO \
      -x NCCL_NSOCKS_PERTHREAD \
      -x NCCL_SOCKET_NTHREADS \
      -x NCCL_MAX_NCHANNELS \
      -x NCCL_MIN_NCHANNELS \
      -x NCCL_DYNAMIC_CHUNK_SIZE \
      -x NCCL_P2P_NET_CHUNKSIZE \
      -x NCCL_P2P_PCI_CHUNKSIZE \
      -x NCCL_P2P_NVL_CHUNKSIZE \
      -x NCCL_BUFFSIZE \
      -x CUDA_VISIBLE_DEVICES \
      -x NCCL_GPUDIRECTTCPX_SOCKET_IFNAME \
      -x NCCL_GPUDIRECTTCPX_CTRL_DEV \
      -x NCCL_GPUDIRECTTCPX_TX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_RX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS \
      -x NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX \
      -x NCCL_GPUDIRECTTCPX_FORCE_ACK \
      -x NCCL_NET_GDR_LEVEL \
      -x NCCL_P2P_PXN_LEVEL \
      -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ENV \
      --mca btl tcp,self \
      --mca btl_tcp_if_include eth0 \
      --mca plm_rsh_args "-p 10022" \
      /third_party/nccl-tests-mpi/build/all_reduce_perf \
      -b 8 \
      -e 2G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 20
  else
    echo "Worker nodes"
  fi

config:
  gcp:
    enable_gpu_direct: true
    managed_instance_group:
      run_duration: 36000
      provision_timeout: 900

nccl_tcpx_gke_h100.yaml

# This example is used to test the NCCL performance with
# GPUDirect TCPX on GKE.#
#
# Usage:
# sky launch -c nccl examples/gcp_gpu_direct_tcpx/nccl_tcpx_gke_h100.yaml
name: nccl-gpudirect-tcpx

resources:
  infra: k8s
  accelerators: H100:8
  image_id: docker:us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx
  network_tier: best

num_nodes: 2

envs:
  USE_GPU_DIRECT: "true"

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    if [ "${USE_GPU_DIRECT}" == "true" ]; then
      export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/tcpx/lib64
    else
      # Use the default NCCL library
      export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64
    fi
    
    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_SOCKET_IFNAME \
      -x NCCL_CROSS_NIC \
      -x NCCL_ALGO \
      -x NCCL_PROTO \
      -x NCCL_NSOCKS_PERTHREAD \
      -x NCCL_SOCKET_NTHREADS \
      -x NCCL_MAX_NCHANNELS \
      -x NCCL_MIN_NCHANNELS \
      -x NCCL_DYNAMIC_CHUNK_SIZE \
      -x NCCL_P2P_NET_CHUNKSIZE \
      -x NCCL_P2P_PCI_CHUNKSIZE \
      -x NCCL_P2P_NVL_CHUNKSIZE \
      -x NCCL_BUFFSIZE \
      -x CUDA_VISIBLE_DEVICES \
      -x NCCL_GPUDIRECTTCPX_SOCKET_IFNAME \
      -x NCCL_GPUDIRECTTCPX_CTRL_DEV \
      -x NCCL_GPUDIRECTTCPX_TX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_RX_BINDINGS \
      -x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS \
      -x NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX \
      -x NCCL_GPUDIRECTTCPX_FORCE_ACK \
      -x NCCL_NET_GDR_LEVEL \
      -x NCCL_P2P_PXN_LEVEL \
      -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=ENV \
      --mca btl tcp,self \
      --mca btl_tcp_if_include eth0 \
      --mca plm_rsh_args "-p 222" \
      /third_party/nccl-tests-mpi/build/all_reduce_perf \
      -b 8 \
      -e 2G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 20
  else
    echo "Worker nodes"
  fi

tcpx_sglang_serving.yaml

name: tcpx-sglang-serving

resources:
  infra: gcp
  gpus: H100:8
  network_tier: best
  disk_size: 1024
  disk_tier: best
  ports: 30000

num_nodes: 2 # Specify number of nodes to launch; requirements may vary based on accelerators

setup: |
  source ~/sky-venv/bin/activate
  if [ $? -eq 0 ]; then
    echo 'venv exists'
  else
    uv venv ~/sky-venv --seed --python=3.10
    source ~/sky-venv/bin/activate
  fi
  
  # Install sglang with all dependencies using uv
  uv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer
  uv pip install setuptools

run: |
  source ~/sky-venv/bin/activate
  
  # Launch the server with appropriate configuration
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  # TP should be number of GPUs per node times number of nodes
  TP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python -m sglang.launch_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --tp $TP \
    --dist-init-addr ${MASTER_ADDR}:5000 \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000


config:
  gcp:
    managed_instance_group:
      run_duration: 3600
      provision_timeout: 900

tcpx_sglang_serving_gke.yaml

name: tcpx-sglang-serving-gke

resources:
  infra: k8s
  gpus: H100:8
  network_tier: best
  disk_size: 1024
  disk_tier: best
  ports: 30000

num_nodes: 2 # Specify number of nodes to launch; requirements may vary based on accelerators

setup: |
  sudo apt install -y nvidia-cuda-toolkit libnuma1
  
  source ~/sky-venv/bin/activate
  if [ $? -eq 0 ]; then
    echo 'venv exists'
  else
    uv venv ~/sky-venv --seed --python=3.10
    source ~/sky-venv/bin/activate
  fi
  
  # Install sglang with all dependencies using uv
  uv pip install "sglang[all]>=0.4.7.post1"
  uv pip install setuptools

run: |
  source ~/sky-venv/bin/activate
  
  # Launch the server with appropriate configuration
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  # TP should be number of GPUs per node times number of nodes
  TP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python -m sglang.launch_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --tp $TP \
    --dist-init-addr ${MASTER_ADDR}:5000 \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 30000