Using InfiniBand in Nebius with SkyPilot#

Setup InfiniBand with a single SkyPilot configuration#

SkyPilot provides the network_tier: best configuration option that automatically enables InfiniBand support on Nebius Kubernetes clusters and Nebius VMs. This eliminates the need for manual configuration of security contexts and environment variables.

InfiniBand on Nebius managed Kubernetes clusters#

Simply add network_tier: best to your resources specification:

resources:
  infra: k8s
  accelerators: H100:8
  network_tier: best

To create a Nebius Kubernetes cluster with InfiniBand enabled, check the Appendix.

End-to-end Example#

Check the nccl_network_tier.yaml for a complete example using the simplified configuration:

sky launch -c nccl_network_tier nccl_network_tier.yaml

This enables the InfiniBand for inter-GPU communication, and SkyPilot will automatically setup the environment variables for you.

Equivalent way to turn on InfiniBand manually

With Nebius managed Kubernetes cluster, you can also turn on InfiniBand manually:

Set the following config in your SkyPilot task YAML to enable InfiniBand:

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - securityContext:
            capabilities:
              add:
              - IPC_LOCK

Configure the environment variables in your task:

run: |
  export NCCL_IB_HCA=mlx5
  export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
  ... your own run script ...

Check more details in nccl.yaml

Running NCCL test using SkyPilot#

Check the nccl_network_tier.yaml for the complete SkyPilot cluster yaml configurations.

The image_id provides the environment setup for NCCL (NVIDIA Collective Communications Library).

To run the NCCL test with InfiniBand support:

sky launch -c infiniband nccl_network_tier.yaml

SkyPilot will:

Schedule the job on a Kubernetes cluster with required GPU nodes
Launch Pods and execute the NCCL performance test
Output performance metrics showing the benefits of InfiniBand for distributed training

The example result is as below:

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   536870912     134217728     float     sum      -1   2432.7  220.69  413.79      0   2382.4  225.35  422.54      0
  1073741824     268435456     float     sum      -1   4523.3  237.38  445.09      0   4518.9  237.61  445.52      0
  2147483648     536870912     float     sum      -1   8785.8  244.43  458.30      0   8787.2  244.39  458.23      0
  4294967296    1073741824     float     sum      -1    17404  246.79  462.73      0    17353  247.50  464.07      0
  8589934592    2147483648     float     sum      -1    34468  249.21  467.28      0    34525  248.80  466.51      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 450.404

NOTE: To run NCCL tests without InfiniBand, you can create the node group without the GPU cluster. Then launch a cluster with nccl_no_ib.yaml with the config field removed:

sky launch -c no_infiniband nccl_no_ib.yaml

InfiniBand on Nebius VMs with SkyPilot#

While the previous section covered InfiniBand setup for managed Kubernetes service, you can also enable InfiniBand directly on Nebius VMs. This approach gives you more flexibility and control over your infrastructure. For detailed instructions, refer to the Nebius documentation.

Automatic InfiniBand Setup with SkyPilot#

SkyPilot simplifies the process of setting up InfiniBand-enabled GPU clusters on Nebius VMs. When you launch a cluster with the appropriate configurations, SkyPilot will automatically create a GPU cluster with InfiniBand support and add VMs to the GPU cluster.

To enable automatic InfiniBand setup, you can simply choose the best network in your SkyPilot YAML:

resources:
  network_tier: best

SkyPilot will automatically configure the InfiniBand with the correct fabric for you. (Note that, Infiniband is only supported by two GPU types, H100:8 and H200:8. Refer to Nebius Docs).

Running Performance Tests#

You can verify your InfiniBand setup by running either of these tests:

NCCL Performance Test (with specific docker image):

sky launch -c infiniband nccl_vm_ib.yaml

Result example:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   536870912     134217728     float     sum      -1   2399.1  223.78  419.59      0   2354.3  228.04  427.57      0
  1073741824     268435456     float     sum      -1   4469.9  240.22  450.41      0   4463.1  240.58  451.09      0
  2147483648     536870912     float     sum      -1   8678.7  247.44  463.96      0   8667.1  247.77  464.57      0
  4294967296    1073741824     float     sum      -1    17053  251.86  472.24      0    17112  250.99  470.60      0
  8589934592    2147483648     float     sum      -1    33792  254.20  476.62      0    33735  254.63  477.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 457.407 

NCCL Performance Test (setting up NCCL):

sky launch -c infiniband nccl_no_docker_ib.yaml

Result example:

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   536870912     134217728     float     sum      -1   2378.7  225.70  423.19      0   2358.1  227.68  426.89      0
  1073741824     268435456     float     sum      -1   4464.5  240.50  450.95      0   4461.2  240.69  451.29      0
  2147483648     536870912     float     sum      -1   8697.7  246.90  462.94      0   8699.8  246.84  462.83      0
  4294967296    1073741824     float     sum      -1    17406  246.75  462.66      0    17185  249.93  468.62      0
  8589934592    2147483648     float     sum      -1    33782  254.28  476.77      0    33732  254.65  477.48      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 456.361

InfiniBand Direct Test:

sky launch -c infiniband infiniband.yaml

Result example:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x4624 QPN 0x0127 PSN 0x45bd8e
 remote address: LID 0x461b QPN 0x0127 PSN 0x1d3746
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      1000             357.12             353.53             0.674308
---------------------------------------------------------------------------------------

Additional Resources#

The Nebius team maintains a comprehensive collection of example configurations in their ml-cookbook repository. These examples cover various use cases and can help you get started with different ML workloads on Nebius using SkyPilot.

Appendix: creating a Nebius Kubernetes cluster with InfiniBand enabled#

To enable infiniband for a Nebius Kubernetes cluster, you need to create a GPU node group with InfiniBand enabled, for more details, refer to the Nebius documentation.

Create a managed service for Kubernetes cluster or bring in your own Kubernetes cluster.

Create a Nebius Kubernetes cluster:

export PROJECT_ID=your-project-id
export NB_SUBNET_ID=$(nebius vpc subnet list \
  --parent-id $PROJECT_ID \
  --format json \
  | jq -r '.items[0].metadata.id')

export NB_K8S_CLUSTER_ID=$(nebius mk8s cluster create \
  --name infini \
  --control-plane-version 1.30 \
  --control-plane-subnet-id $NB_SUBNET_ID \
  --control-plane-endpoints-public-endpoint=true \
  --parent-id=$PROJECT_ID \
  --format json | jq -r '.metadata.id')

Or, Bring in your own Kubernetes cluster

Find your Kubernetes cluster ID on the console or using the following command:

export PROJECT_ID=your-project-id
# Use the first cluster in the list
export NB_K8S_CLUSTER_ID=$(nebius mk8s cluster list \
  --parent-id $PROJECT_ID \
  --format json \
  | jq -r '.items[0].metadata.id')

To enable InfiniBand for a node group, you need to create a GPU cluster first, then specify the GPU cluster when creating the node group.

export INFINIBAND_FABRIC=fabric-3
export NB_GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
  --name gpu-cluster-name \
  --infiniband-fabric $INFINIBAND_FABRIC \
  --parent-id $PROJECT_ID \
  --format json \
  | jq -r ".metadata.id")

nebius mk8s node-group create \
  --parent-id $NB_K8S_CLUSTER_ID \
  --name infini-ib-group \
  --fixed-node-count 2 \
  --template-resources-platform gpu-h100-sxm \
  --template-resources-preset 8gpu-128vcpu-1600gb \
  --template-gpu-cluster-id $NB_GPU_CLUSTER_ID \
  --template-gpu-settings-drivers-preset cuda12

Refer to the Nebius documentation for how to select the fabric according to the type of GPUs you are going to use.

Setup Kubeconfig and setup Nvidia GPUs

nebius mk8s cluster get-credentials --id $NB_K8S_CLUSTER_ID --external
sky check k8s

Note: To create a node group with a GPU cluster, you need to specify a compatible preset (number of GPUs and vCPUs, RAM size). The compatible platforms and presets are as below:

Platform	Presets	Regions
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm)	8gpu-128vcpu-1600gb	eu-north1
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm)	8gpu-128vcpu-1600gb	eu-north1, eu-west1, us-central1

Now you have a Kubernetes cluster that have the GPUs interconnected using InfiniBand.

Included files#

infiniband.yaml

# This example is used to test the InfiniBand
# connection between two VMs.
resources:
  cloud: nebius
  region: eu-north1
  accelerators: H100:8
  network_tier: best
  
num_nodes: 2

setup: |
  sudo apt install perftest -y

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
      ib_send_bw --report_gbits -n 1000 -F > /dev/null
  elif [ "${SKYPILOT_NODE_RANK}" == "1" ]; then
      echo "MASTER_ADDR: $MASTER_ADDR"
      sleep 2 # wait for the master to start
      ib_send_bw $MASTER_ADDR --report_gbits -n 1000 -F
  fi

nccl.yaml

# This example is used to test the NCCL performance with
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-test

resources:
  infra: k8s
  accelerators: H100:8
  image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4

num_nodes: 2


run: |
  export NCCL_IB_HCA=mlx5
  export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
  
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      --bind-to none \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_DEBUG=INFO \
      -x NCCL_SOCKET_IFNAME=eth0 \
      -x NCCL_IB_HCA \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      /opt/nccl-tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 10
  else
    echo "Worker nodes"
  fi

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - securityContext:
            capabilities:
              add:
              - IPC_LOCK

nccl_network_tier.yaml

# This example is used to test the NCCL performance with
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-network-tier

resources:
  infra: k8s
  accelerators: H100:8
  image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
  network_tier: best

num_nodes: 1


run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      --bind-to none \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_DEBUG=INFO \
      -x NCCL_SOCKET_IFNAME=eth0 \
      -x NCCL_IB_HCA \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      /opt/nccl-tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 10
  else
    echo "Worker nodes"
  fi

nccl_no_docker_ib.yaml

# This example is used to test the NCCL performance
# with InfiniBand on Nebius VMs.
resources:
  cloud: nebius
  region: us-central1
  accelerators: H200:8
  network_tier: best

num_nodes: 2

setup: |
  sudo apt-get update
  sudo apt-get install -y iproute2 wget curl build-essential git

  # Install OpenMPI first (required for NCCL tests)
  echo "Installing OpenMPI..."
  sudo apt-get install -y openmpi-bin openmpi-common libopenmpi-dev
  
  # Install CUDA toolkit
  if ! command -v nvcc &> /dev/null; then
    echo "Installing CUDA toolkit..."
    wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
    sudo sh cuda_12.4.0_550.54.14_linux.run --silent --toolkit
    
    # Add CUDA to PATH and LD_LIBRARY_PATH
    echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
    echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  else
    echo "CUDA already installed, skipping..."
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  fi

  # Install NCCL (use sudo to remove if exists)
  sudo rm -rf /usr/local/nccl
  echo "Installing NCCL..."
  cd /tmp
  wget https://github.com/NVIDIA/nccl/archive/v2.23.4-1.tar.gz
  tar -xzf v2.23.4-1.tar.gz
  cd nccl-2.23.4-1
  make -j src.build CUDA_HOME=/usr/local/cuda
  sudo mkdir -p /usr/local/nccl
  sudo cp -r build/* /usr/local/nccl/
  echo 'export LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
  export LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH

  # Build NCCL tests
  if [ ! -f $HOME/nccl-tests/build/all_reduce_perf ]; then
    echo "Building NCCL tests..."
    cd $HOME
    git clone https://github.com/NVIDIA/nccl-tests.git
    cd nccl-tests
    
    # Ensure environment variables are set for compilation
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl/lib:$LD_LIBRARY_PATH
    
    # Get MPI compile flags from mpicc
    MPI_CFLAGS=$(mpicc -showme:compile)
    MPI_LDFLAGS=$(mpicc -showme:link)
    MPI_HOME=$(dirname $(dirname $(which mpicc)))
    
    # Build with MPI flags from mpicc if not already built
    make MPI=1 \
        MPI_HOME="$MPI_HOME" \
        CUDA_HOME=/usr/local/cuda \
        NCCL_HOME=/usr/local/nccl \
        CPPFLAGS="$MPI_CFLAGS" \
        NVCCFLAGS="$MPI_CFLAGS" \
        LDFLAGS="$MPI_LDFLAGS" \
        -j
  fi

run: | 
  # Load environment
  export PATH=/usr/local/cuda/bin:$PATH
  export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl/lib:$LD_LIBRARY_PATH

  export SSH_PORT=$(ss -tlnp | grep sshd | awk '{print $4}' | cut -d':' -f2)
  export NCCL_SOCKET_IFNAME=$(ip -o -4 route show to default | awk '{print $5}')
  export NCCL_IB_HCA=mlx5
  export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1

  # Debug SSH port detection
  echo "=== SSH Port Detection Debug ==="
  echo "ss -tlnp | grep sshd output:"
  ss -tlnp | grep sshd
  echo "awk '{print \$4}' output:"
  ss -tlnp | grep sshd | awk '{print $4}'
  echo "Final SSH_PORT: $SSH_PORT"
  
  # Fix SSH port detection
  SSH_PORT_RAW=$(ss -tlnp | grep sshd | awk '{print $4}' | head -1)
  export SSH_PORT=$(echo "$SSH_PORT_RAW" | sed 's/.*://')
  if [ -z "$SSH_PORT" ]; then
    export SSH_PORT=22
  fi
  echo "Corrected SSH_PORT: $SSH_PORT"
  echo "================================="

  # Total number of processes, NP should be the total number of GPUs in the cluster
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  # Create nodes list with slots for MPI
  nodes=""
  for ip in $SKYPILOT_NODE_IPS; do
    nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
  done
  nodes=${nodes::-1}
  echo "Node list: $nodes"
  echo "Total processes: $NP"

  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Starting NCCL all-reduce test on ${SKYPILOT_NUM_NODES} nodes with ${NP} total GPUs..."
    mpirun \
      --allow-run-as-root \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -bind-to none \
      -x LD_LIBRARY_PATH \
      -x NCCL_SOCKET_IFNAME \
      -x NCCL_IB_HCA \
      -x NCCL_ALGO=NVLSTree \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      --mca plm_rsh_args "-p $SSH_PORT" \
      $HOME/nccl-tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1
  else
    echo "worker node"
  fi

nccl_no_ib.yaml

# This example is used to test the NCCL performance without
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-test

resources:
  infra: k8s
  accelerators: H100:8
  image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4

num_nodes: 2


run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    done
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    export NCCL_IB_HCA=""
    export UCX_NET_DEVICES="eth0"


    mpirun \
      --allow-run-as-root \
      --tag-output \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      --bind-to none \
      -x PATH \
      -x LD_LIBRARY_PATH \
      -x NCCL_DEBUG=INFO \
      -x NCCL_SOCKET_IFNAME=eth0 \
      -x NCCL_IB_HCA \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      /opt/nccl-tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1 \
      -c 1 \
      -w 5 \
      -n 10
  else
    echo "Worker nodes"
  fi

nccl_vm_ib.yaml

# This example is used to test the NCCL performance
# with InfiniBand on Nebius VMs.
resources:
  cloud: nebius
  region: us-central1
  accelerators: H100:8
  image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
  network_tier: best

num_nodes: 2

setup: |
  sudo apt-get install -y iproute2

run: | 
  # port 10022 in containers on Nebius VMs
  export SSH_PORT=$(ss -tlnp | grep sshd | awk '{print $4}' | cut -d':' -f2)
  export NCCL_SOCKET_IFNAME=$(ip -o -4 route show to default | awk '{print $5}')

  # Total number of processes, NP should be the total number of GPUs in the cluster
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
  nodes=""
  for ip in $SKYPILOT_NODE_IPS; do
    nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
  done
  nodes=${nodes::-1}

  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    mpirun \
      --allow-run-as-root \
      -H $nodes \
      -np $NP \
      -N $SKYPILOT_NUM_GPUS_PER_NODE \
      -bind-to none \
      -x LD_LIBRARY_PATH \
      -x NCCL_SOCKET_IFNAME \
      -x NCCL_IB_HCA \
      -x NCCL_ALGO=NVLSTree \
      -x UCX_NET_DEVICES \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
      -x NCCL_COLLNET_ENABLE=0 \
      --mca plm_rsh_args "-p $SSH_PORT" \
      /opt/nccl_tests/build/all_reduce_perf \
      -b 512M \
      -e 8G \
      -f 2 \
      -g 1
  else
    echo "worker node"
  fi