Source: examples/together_infiniband
Using InfiniBand in Together AI with SkyPilot#
SkyPilot provides the network_tier: best configuration option that automatically enables InfiniBand support on Together AI Kubernetes clusters. This eliminates the need for manual configuration of security contexts and environment variables.
InfiniBand on Together AI Kubernetes clusters#
Simply add network_tier: best to your resources specification:
resources:
infra: k8s
accelerators: H100:8
network_tier: best
This enables the InfiniBand for inter-GPU communication, and SkyPilot will automatically setup the environment variables for you.
Running NCCL test using SkyPilot#
Check the nccl_network_tier.yaml for the complete SkyPilot cluster yaml configurations.
The image_id provides the environment setup for NCCL (NVIDIA Collective Communications Library).
To run the NCCL test with InfiniBand support:
sky launch -c infiniband nccl_network_tier.yaml
SkyPilot will:
Schedule the job on the Kubernetes cluster with required GPU nodes
Launch Pods and execute the NCCL performance test
Output performance metrics showing the benefits of InfiniBand for distributed training
The example result is as below:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 2407.5 222.99 418.12 0 2380.3 225.55 422.90 0
1073741824 268435456 float sum -1 4524.3 237.33 444.99 0 4531.6 236.95 444.28 0
2147483648 536870912 float sum -1 8787.5 244.38 458.21 0 8780.7 244.57 458.56 0
4294967296 1073741824 float sum -1 17327 247.88 464.77 0 17328 247.86 464.74 0
8589934592 2147483648 float sum -1 34462 249.26 467.36 0 34482 249.11 467.08 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 451.101
NOTE: To run NCCL tests without InfiniBand, you can launch a cluster with
nccl_no_ib.yaml:sky launch -c no_infiniband nccl_no_ib.yaml
Included files#
nccl_network_tier.yaml
# This example is used to test the NCCL performance with
# InfiniBand on Together AI Kubernetes cluster.
name: nccl-network-tier
resources:
infra: k8s
accelerators: H100:8
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
network_tier: best
num_nodes: 2
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
/usr/local/bin/all_reduce_perf_mpi \
-b 512M \
-e 8G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 10
else
echo "Worker nodes"
fi
nccl_no_ib.yaml
# This example is used to test the NCCL performance without
# InfiniBand on Together AI Kubernetes cluster.
name: nccl-no-ib
resources:
infra: k8s
accelerators: H100:8
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
num_nodes: 2
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
echo "Head node"
# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
# Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
nodes=""
for ip in $SKYPILOT_NODE_IPS; do
nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
done
nodes=${nodes::-1}
echo "All nodes: ${nodes}"
export NCCL_IB_HCA=""
export UCX_NET_DEVICES="eth0"
mpirun \
--allow-run-as-root \
--tag-output \
-H $nodes \
-np $NP \
-N $SKYPILOT_NUM_GPUS_PER_NODE \
--bind-to none \
-x PATH \
-x LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA \
-x UCX_NET_DEVICES \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
/usr/local/bin/all_reduce_perf_mpi \
-b 512M \
-e 8G \
-f 2 \
-g 1 \
-c 1 \
-w 5 \
-n 10
else
echo "Worker nodes"
fi