Distributed Training with PyTorch#

This example demonstrates how to run distributed training with PyTorch using SkyPilot.

The example is based on PyTorch’s official minGPT example.

Overview#

There are two ways to run distributed training with PyTorch:

Using normal torchrun
Using rdvz backend

The main difference between the two for fixed-size distributed training is that rdvz backend automatically handles the rank for each node, while torchrun requires the rank to be set manually.

SkyPilot offers convenient built-in environment variables to help you start distributed training easily.

Using normal `torchrun`#

The following command will spawn 2 nodes with 2 L4 GPU each:

sky launch -c train train.yaml

In train.yaml, we use torchrun to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

run: |
    cd examples/mingpt
    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=8008 \
    --node_rank=${SKYPILOT_NODE_RANK} \
    main.py

Using `rdzv` backend#

rdzv is an alternative backend for distributed training:

sky launch -c train-rdzv train-rdzv.yaml

In train-rdzv.yaml, we use torchrun to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

run: |
    cd examples/mingpt
    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:29500 \
    --rdzv_id $SKYPILOT_TASK_ID \
    main.py

Scale up#

If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot’s built-in environment variables will be set automatically.

For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+

We increase the --cpus to 8+ as well to avoid the performance to be bottlenecked by the CPU.

Included files#

train-rdzv.yaml

name: minGPT-ddp-rdzv

resources:
    cpus: 4+
    accelerators: L4

num_nodes: 2

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    source .venv/bin/activate
    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
    cd examples
    source .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:29500 \
    --rdzv_id $SKYPILOT_TASK_ID \
    main.py

train.yaml

name: minGPT-ddp

resources:
    cpus: 4+
    accelerators: L4

num_nodes: 2

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    source .venv/bin/activate
    uv pip install -r requirements.txt "numpy<2" "torch==2.7.1+cu118" --extra-index-url https://download.pytorch.org/whl/cu118

run: |
    cd examples
    source .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    # Explicit check for torchrun
    if ! command -v torchrun >/dev/null 2>&1; then
        echo "ERROR: torchrun command not found" >&2
        exit 1
    fi

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=8008 \
    --node_rank=${SKYPILOT_NODE_RANK} \
    main.py

Distributed Training with PyTorch#

Overview#

Using normal torchrun#

Using rdzv backend#

Scale up#

Included files#

Using normal `torchrun`#

Using `rdzv` backend#