Source: examples/nebius_infiniband
Using InfiniBand in Nebius with SkyPilot#
Setup InfiniBand with a single SkyPilot configuration#
SkyPilot provides the network_tier: best
configuration option that automatically enables InfiniBand support on Nebius Kubernetes clusters and Nebius VMs. This eliminates the need for manual configuration of security contexts and environment variables.
InfiniBand on Nebius managed Kubernetes clusters#
Simply add network_tier: best
to your resources specification:
resources:
infra: k8s
accelerators: H100:8
network_tier: best
To create a Nebius Kubernetes cluster with InfiniBand enabled, check the Appendix.
End-to-end Example#
Check the nccl_network_tier.yaml
for a complete example using the simplified configuration:
sky launch -c nccl_network_tier nccl_network_tier.yaml
This enables the InfiniBand for inter-GPU communication, and SkyPilot will automatically setup the environment variables for you.
Equivalent way to turn on InfiniBand manually
With Nebius managed Kubernetes cluster, you can also turn on InfiniBand manually:
Set the following config in your SkyPilot task YAML to enable InfiniBand:
config:
kubernetes:
pod_config:
spec:
containers:
- securityContext:
capabilities:
add:
- IPC_LOCK
Configure the environment variables in your task:
run: |
export NCCL_IB_HCA=mlx5
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
... your own run script ...
Check more details in nccl.yaml
Running NCCL test using SkyPilot#
Check the nccl_network_tier.yaml
for the complete SkyPilot cluster yaml configurations.
The image_id
provides the environment setup for NCCL (NVIDIA Collective Communications Library).
To run the NCCL test with InfiniBand support:
sky launch -c infiniband nccl_network_tier.yaml
SkyPilot will:
Schedule the job on a Kubernetes cluster with required GPU nodes
Launch Pods and execute the NCCL performance test
Output performance metrics showing the benefits of InfiniBand for distributed training
The example result is as below:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 2432.7 220.69 413.79 0 2382.4 225.35 422.54 0
1073741824 268435456 float sum -1 4523.3 237.38 445.09 0 4518.9 237.61 445.52 0
2147483648 536870912 float sum -1 8785.8 244.43 458.30 0 8787.2 244.39 458.23 0
4294967296 1073741824 float sum -1 17404 246.79 462.73 0 17353 247.50 464.07 0
8589934592 2147483648 float sum -1 34468 249.21 467.28 0 34525 248.80 466.51 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 450.404
NOTE: To run NCCL tests without InfiniBand, you can create the node group without the GPU cluster. Then launch a cluster with
nccl_no_ib.yaml
with the config field removed:
sky launch -c no_infiniband nccl_no_ib.yaml
InfiniBand on Nebius VMs with SkyPilot#
While the previous section covered InfiniBand setup for managed Kubernetes service, you can also enable InfiniBand directly on Nebius VMs. This approach gives you more flexibility and control over your infrastructure. For detailed instructions, refer to the Nebius documentation.
Automatic InfiniBand Setup with SkyPilot#
SkyPilot simplifies the process of setting up InfiniBand-enabled GPU clusters on Nebius VMs. When you launch a cluster with the appropriate configurations, SkyPilot will automatically create a GPU cluster with InfiniBand support and add VMs to the GPU cluster.
To enable automatic InfiniBand setup, you can simply choose the best network in your SkyPilot YAML:
resources:
network_tier: best
SkyPilot will automatically configure the InfiniBand with the correct fabric for you. (Note that, Infiniband is only supported by two GPU types, H100:8 and H200:8. Refer to Nebius Docs).
Running Performance Tests#
You can verify your InfiniBand setup by running either of these tests:
NCCL Performance Test (with specific docker image):
sky launch -c infiniband nccl_vm_ib.yaml
Result example:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 2399.1 223.78 419.59 0 2354.3 228.04 427.57 0
1073741824 268435456 float sum -1 4469.9 240.22 450.41 0 4463.1 240.58 451.09 0
2147483648 536870912 float sum -1 8678.7 247.44 463.96 0 8667.1 247.77 464.57 0
4294967296 1073741824 float sum -1 17053 251.86 472.24 0 17112 250.99 470.60 0
8589934592 2147483648 float sum -1 33792 254.20 476.62 0 33735 254.63 477.42 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 457.407
NCCL Performance Test (setting up NCCL):
sky launch -c infiniband nccl_no_docker_ib.yaml
Result example:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 2378.7 225.70 423.19 0 2358.1 227.68 426.89 0
1073741824 268435456 float sum -1 4464.5 240.50 450.95 0 4461.2 240.69 451.29 0
2147483648 536870912 float sum -1 8697.7 246.90 462.94 0 8699.8 246.84 462.83 0
4294967296 1073741824 float sum -1 17406 246.75 462.66 0 17185 249.93 468.62 0
8589934592 2147483648 float sum -1 33782 254.28 476.77 0 33732 254.65 477.48 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 456.361
InfiniBand Direct Test:
sky launch -c infiniband infiniband.yaml
Result example:
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x4624 QPN 0x0127 PSN 0x45bd8e
remote address: LID 0x461b QPN 0x0127 PSN 0x1d3746
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1000 357.12 353.53 0.674308
---------------------------------------------------------------------------------------
Additional Resources#
The Nebius team maintains a comprehensive collection of example configurations in their ml-cookbook repository. These examples cover various use cases and can help you get started with different ML workloads on Nebius using SkyPilot.
Appendix: creating a Nebius Kubernetes cluster with InfiniBand enabled#
To enable infiniband for a Nebius Kubernetes cluster, you need to create a GPU node group with InfiniBand enabled, for more details, refer to the Nebius documentation.
Create a managed service for Kubernetes cluster or bring in your own Kubernetes cluster.
Create a Nebius Kubernetes cluster:
export PROJECT_ID=your-project-id
export NB_SUBNET_ID=$(nebius vpc subnet list \
--parent-id $PROJECT_ID \
--format json \
| jq -r '.items[0].metadata.id')
export NB_K8S_CLUSTER_ID=$(nebius mk8s cluster create \
--name infini \
--control-plane-version 1.30 \
--control-plane-subnet-id $NB_SUBNET_ID \
--control-plane-endpoints-public-endpoint=true \
--parent-id=$PROJECT_ID \
--format json | jq -r '.metadata.id')
Or, Bring in your own Kubernetes cluster
Find your Kubernetes cluster ID on the console or using the following command:
export PROJECT_ID=your-project-id
# Use the first cluster in the list
export NB_K8S_CLUSTER_ID=$(nebius mk8s cluster list \
--parent-id $PROJECT_ID \
--format json \
| jq -r '.items[0].metadata.id')
To enable InfiniBand for a node group, you need to create a GPU cluster first, then specify the GPU cluster when creating the node group.
export INFINIBAND_FABRIC=fabric-3
export NB_GPU_CLUSTER_ID=$(nebius compute gpu-cluster create \
--name gpu-cluster-name \
--infiniband-fabric $INFINIBAND_FABRIC \
--parent-id $PROJECT_ID \
--format json \
| jq -r ".metadata.id")
nebius mk8s node-group create \
--parent-id $NB_K8S_CLUSTER_ID \
--name infini-ib-group \
--fixed-node-count 2 \
--template-resources-platform gpu-h100-sxm \
--template-resources-preset 8gpu-128vcpu-1600gb \
--template-gpu-cluster-id $NB_GPU_CLUSTER_ID \
--template-gpu-settings-drivers-preset cuda12
Refer to the Nebius documentation for how to select the fabric according to the type of GPUs you are going to use.
Setup Kubeconfig and setup Nvidia GPUs
nebius mk8s cluster get-credentials --id $NB_K8S_CLUSTER_ID --external
sky check k8s
Note: To create a node group with a GPU cluster, you need to specify a compatible preset (number of GPUs and vCPUs, RAM size). The compatible platforms and presets are as below:
Platform |
Presets |
Regions |
---|---|---|
NVIDIA® H100 NVLink with Intel Sapphire Rapids (gpu-h100-sxm) |
8gpu-128vcpu-1600gb |
eu-north1 |
NVIDIA® H200 NVLink with Intel Sapphire Rapids (gpu-h200-sxm) |
8gpu-128vcpu-1600gb |
eu-north1, eu-west1, us-central1 |
Now you have a Kubernetes cluster that have the GPUs interconnected using InfiniBand.
Included files#
infiniband.yaml
# This example is used to test the InfiniBand
# connection between two VMs.
resources:
cloud: nebius
region: eu-north1
accelerators: H100:8
network_tier: best
num_nodes: 2
setup: |
sudo apt install perftest -y
run: |
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
ib_send_bw --report_gbits -n 1000 -F > /dev/null
elif [ "${SKYPILOT_NODE_RANK}" == "1" ]; then
echo "MASTER_ADDR: $MASTER_ADDR"
sleep 2 # wait for the master to start
ib_send_bw $MASTER_ADDR --report_gbits -n 1000 -F
fi
nccl.yaml
# This example is used to test the NCCL performance with
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-test
resources:
infra: k8s
accelerators: H100:8
image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
num_nodes: 2
run: |
export NCCL_IB_HCA=mlx5
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
/opt/nccl-tests/build/all_reduce_perf \
-b 512M \
-e 8G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 10
else
echo "Worker nodes"
fi
config:
kubernetes:
pod_config:
spec:
containers:
- securityContext:
capabilities:
add:
- IPC_LOCK
nccl_network_tier.yaml
# This example is used to test the NCCL performance with
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-network-tier
resources:
infra: k8s
accelerators: H100:8
image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
network_tier: best
num_nodes: 1
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
/opt/nccl-tests/build/all_reduce_perf \
-b 512M \
-e 8G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 10
else
echo "Worker nodes"
fi
nccl_no_docker_ib.yaml
# This example is used to test the NCCL performance
# with InfiniBand on Nebius VMs.
resources:
cloud: nebius
region: us-central1
accelerators: H200:8
network_tier: best
num_nodes: 2
setup: |
sudo apt-get update
sudo apt-get install -y iproute2 wget curl build-essential git
# Install OpenMPI first (required for NCCL tests)
echo "Installing OpenMPI..."
sudo apt-get install -y openmpi-bin openmpi-common libopenmpi-dev
# Install CUDA toolkit
if ! command -v nvcc &> /dev/null; then
echo "Installing CUDA toolkit..."
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run --silent --toolkit
# Add CUDA to PATH and LD_LIBRARY_PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
else
echo "CUDA already installed, skipping..."
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
fi
# Install NCCL (use sudo to remove if exists)
sudo rm -rf /usr/local/nccl
echo "Installing NCCL..."
cd /tmp
wget https://github.com/NVIDIA/nccl/archive/v2.23.4-1.tar.gz
tar -xzf v2.23.4-1.tar.gz
cd nccl-2.23.4-1
make -j src.build CUDA_HOME=/usr/local/cuda
sudo mkdir -p /usr/local/nccl
sudo cp -r build/* /usr/local/nccl/
echo 'export LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
export LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH
# Build NCCL tests
if [ ! -f $HOME/nccl-tests/build/all_reduce_perf ]; then
echo "Building NCCL tests..."
cd $HOME
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
# Ensure environment variables are set for compilation
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl/lib:$LD_LIBRARY_PATH
# Get MPI compile flags from mpicc
MPI_CFLAGS=$(mpicc -showme:compile)
MPI_LDFLAGS=$(mpicc -showme:link)
MPI_HOME=$(dirname $(dirname $(which mpicc)))
# Build with MPI flags from mpicc if not already built
make MPI=1 \
MPI_HOME="$MPI_HOME" \
CUDA_HOME=/usr/local/cuda \
NCCL_HOME=/usr/local/nccl \
CPPFLAGS="$MPI_CFLAGS" \
NVCCFLAGS="$MPI_CFLAGS" \
LDFLAGS="$MPI_LDFLAGS" \
-j
fi
run: |
# Load environment
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl/lib:$LD_LIBRARY_PATH
export SSH_PORT=$(ss -tlnp | grep sshd | awk '{print $4}' | cut -d':' -f2)
export NCCL_SOCKET_IFNAME=$(ip -o -4 route show to default | awk '{print $5}')
export NCCL_IB_HCA=mlx5
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
# Debug SSH port detection
echo "=== SSH Port Detection Debug ==="
echo "ss -tlnp | grep sshd output:"
ss -tlnp | grep sshd
echo "awk '{print \$4}' output:"
ss -tlnp | grep sshd | awk '{print $4}'
echo "Final SSH_PORT: $SSH_PORT"
# Fix SSH port detection
SSH_PORT_RAW=$(ss -tlnp | grep sshd | awk '{print $4}' | head -1)
export SSH_PORT=$(echo "$SSH_PORT_RAW" | sed 's/.*://')
if [ -z "$SSH_PORT" ]; then
export SSH_PORT=22
fi
echo "Corrected SSH_PORT: $SSH_PORT"
echo "================================="
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Create nodes list with slots for MPI
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "Node list: $nodes"
echo "Total processes: $NP"
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Starting NCCL all-reduce test on ${SKYPILOT_NUM_NODES} nodes with ${NP} total GPUs..."
mpirun \
--allow-run-as-root \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
-bind-to none \
-x LD_LIBRARY_PATH \
-x NCCL_SOCKET_IFNAME \
-x NCCL_IB_HCA \
-x NCCL_ALGO=NVLSTree \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
--mca plm_rsh_args "-p $SSH_PORT" \
$HOME/nccl-tests/build/all_reduce_perf \
-b 512M \
-e 8G \
-f 2 \
-g 1
else
echo "worker node"
fi
nccl_no_ib.yaml
# This example is used to test the NCCL performance without
# InfiniBand on managed Nebius Kubernetes cluster.
name: nccl-test
resources:
infra: k8s
accelerators: H100:8
image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
num_nodes: 2
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
export NCCL_IB_HCA=""
export UCX_NET_DEVICES="eth0"
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
/opt/nccl-tests/build/all_reduce_perf \
-b 512M \
-e 8G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 10
else
echo "Worker nodes"
fi
nccl_vm_ib.yaml
# This example is used to test the NCCL performance
# with InfiniBand on Nebius VMs.
resources:
cloud: nebius
region: us-central1
accelerators: H100:8
image_id: docker:cr.eu-north1.nebius.cloud/nebius-benchmarks/nccl-tests:2.23.4-ubu22.04-cu12.4
network_tier: best
num_nodes: 2
setup: |
sudo apt-get install -y iproute2
run: |
# port 10022 in containers on Nebius VMs
export SSH_PORT=$(ss -tlnp | grep sshd | awk '{print $4}' | cut -d':' -f2)
export NCCL_SOCKET_IFNAME=$(ip -o -4 route show to default | awk '{print $5}')
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
mpirun \
--allow-run-as-root \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
-bind-to none \
-x LD_LIBRARY_PATH \
-x NCCL_SOCKET_IFNAME \
-x NCCL_IB_HCA \
-x NCCL_ALGO=NVLSTree \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
--mca plm_rsh_args "-p $SSH_PORT" \
/opt/nccl_tests/build/all_reduce_perf \
-b 512M \
-e 8G \
-f 2 \
-g 1
else
echo "worker node"
fi