Using InfiniBand in CoreWeave with SkyPilot#

Setup InfiniBand with a single SkyPilot configuration#

SkyPilot provides the network_tier: best configuration option that automatically enables InfiniBand support on CoreWeave Kubernetes clusters. This eliminates the need for manual configuration of rdma/ib resources, security contexts and environment variables.

InfiniBand on CoreWeave managed Kubernetes clusters#

Simply add network_tier: best to your resources specification:

resources:
  infra: k8s
  accelerators: H200:8
  network_tier: best

End-to-end Example#

Check the coreweave_nccl_test.yaml for a complete example using the simplified configuration:

sky launch -c nccl coreweave_nccl_test.yaml

This enables the InfiniBand for inter-GPU communication and SkyPilot will automatically setup the environment variables for you.

SkyPilot will:

Schedule the job on a Kubernetes cluster with required GPU nodes
Launch Pods and execute the NCCL performance test
Output performance metrics showing the benefits of InfiniBand for distributed training

NCCL test on 16 H200s (2 nodes with H200:8 each) with infiniband:

#                                                              out-of-place                       in-place
#     size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#      (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 536870912     134217728     float     sum      -1   2387.7  224.85  421.59      0   2325.5  230.86  432.86      0
1073741824     268435456     float     sum      -1   4416.5  243.12  455.85      0   4425.1  242.65  454.96      0
2147483648     536870912     float     sum      -1   8562.4  250.80  470.26      0   8581.4  250.25  469.22      0
4294967296    1073741824     float     sum      -1    16852  254.86  477.86      0    16844  254.98  478.09      0
8589934592    2147483648     float     sum      -1    33460  256.72  481.35      0    33381  257.33  482.49      0

Avg bus bandwidth    : 462.453

NOTE: To run NCCL tests without InfiniBand, you can comment out the network_tier: best line in the YAML file. This will use the default network configuration without InfiniBand.

NCCL test on 16 H200s (2 nodes with H200:8 each) WITHOUT infiniband:

#                                                              out-of-place                       in-place
#     size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#      (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
536870912      134217728     float     sum      -1   269865    1.99    3.73      0   267004    2.01    3.77      0
1073741824     268435456     float     sum      -1   536466    2.00    3.75      0   539227    1.99    3.73      0
2147483648     536870912     float     sum      -1  1071841    2.00    3.76      0  1078080    1.99    3.73      0
4294967296    1073741824     float     sum      -1  2137076    2.01    3.77      0  2143324    2.00    3.76      0
8589934592    2147483648     float     sum      -1  4280791    2.01    3.76      0  4339219    1.98    3.71      0

Avg bus bandwidth    : 3.7478

Included files#

coreweave_nccl_test.yaml

# This example is used to test the NCCL performance with
# InfiniBand on a Coreweave Kubernetes cluster.
#
# Usage:
#  sky launch -c nccl coreweave_nccl_test.yaml
name: nccl-network-tier

resources:
  infra: k8s
  accelerators: H200:8
  image_id: ghcr.io/coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
  network_tier: best  # Automatically requests rdma/ib: 1 resource and sets env vars

num_nodes: 2


run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      --bind-to none \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_DEBUG=INFO \
      -x NCCL_SOCKET_IFNAME=eth0 \
      -x NCCL_IB_HCA \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      /opt/nccl-tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 10
  else
    echo "Worker nodes"
  fi