Source: examples/distributed-pytorch
Distributed Training with PyTorch#
This example demonstrates how to run distributed training with PyTorch using SkyPilot.
The example is based on PyTorch’s official minGPT example.
Overview#
There are two ways to run distributed training with PyTorch:
Using normal
torchrun
Using
rdvz
backend
The main difference between the two for fixed-size distributed training is that rdvz
backend automatically handles the rank for each node, while torchrun
requires the rank to be set manually.
SkyPilot offers convenient built-in environment variables to help you start distributed training easily.
Using normal torchrun
#
The following command will spawn 2 nodes with 2 L4 GPU each:
sky launch -c train train.yaml
In train.yaml, we use torchrun
to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py
Or, run the equivalent code using python SDK:
python sdk_scripts/train.py
python sdk_scripts/train.py
Using rdzv
backend#
rdzv
is an alternative backend for distributed training:
sky launch -c train-rdzv train-rdzv.yaml
In train-rdzv.yaml, we use torchrun
to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
To run the equivalent code using python SDK, run
python sdk_scripts/train_rdzv.py
Scale up#
If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot’s built-in environment variables will be set automatically.
For example, the following command will spawn 4 nodes with 4 L4 GPUs each.
sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+
We increase the --cpus
to 8+ as well to avoid the performance to be bottlenecked by the CPU.
Included files#
sdk_scripts/train.py
"""Distributed training example with PyTorch.
Usage:
python train.py
"""
import subprocess
import sky
task = sky.Task(
name='minGPT-ddp',
resources=sky.Resources(
cpus='4+',
accelerators='L4:2',
),
num_nodes=2,
setup=[
'git clone --depth 1 https://github.com/pytorch/examples || true',
'cd examples',
('git filter-branch --prune-empty '
'--subdirectory-filter distributed/minGPT-ddp'),
'uv venv --python 3.10',
'source .venv/bin/activate',
('uv pip install -r requirements.txt "numpy<2" "torch==2.7.1+cu118" '
'--extra-index-url https://download.pytorch.org/whl/cu118'),
],
run=[
'cd examples',
'source .venv/bin/activate',
'cd mingpt',
'export LOGLEVEL=INFO',
'MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)',
'echo "Starting distributed training, head node: $MASTER_ADDR"',
# Explicit check for torchrun
'if ! command -v torchrun >/dev/null 2>&1; then',
'echo "ERROR: torchrun command not found" >&2'
'exit 1',
'fi',
('torchrun '
'--nnodes=$SKYPILOT_NUM_NODES '
'--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE '
'--master_addr=$MASTER_ADDR '
'--master_port=8008 '
'--node_rank=${SKYPILOT_NODE_RANK} '
'main.py'),
],
)
# Alternatively, load in the cluster YAML from a file
# task = sky.Task.from_yaml('../train.yaml')
cluster_name = 'train'
job_id, _ = sky.stream_and_get(sky.launch(task, cluster_name=cluster_name))
sky.tail_logs(cluster_name, job_id, follow=True)
print('Training completed. Downloading checkpoint...')
subprocess.run(
(f'scp {cluster_name}:~/sky_workdir/examples/mingpt/gpt_snapshot.pt '
'gpt_snapshot.pt'),
shell=True,
check=True)
print('Checkpoint downloaded.')
print(f'Tearing down cluster {cluster_name}...')
sky.stream_and_get(sky.down(cluster_name))
print(f'Cluster {cluster_name} torn down.')
sdk_scripts/train_rdzv.py
"""Distributed training example with PyTorch using `rdzv` backend.
Usage:
python train_rdzv.py
"""
import subprocess
import sky
task = sky.Task(
name='minGPT-ddp',
resources=sky.Resources(
cpus='4+',
accelerators='L4:2',
),
num_nodes=2,
setup=[
'git clone --depth 1 https://github.com/pytorch/examples || true',
'cd examples',
('git filter-branch --prune-empty '
'--subdirectory-filter distributed/minGPT-ddp'),
'uv venv --python 3.10',
'source .venv/bin/activate',
('uv pip install -r requirements.txt "numpy<2" "torch==2.7.1+cu118" '
'--extra-index-url https://download.pytorch.org/whl/cu118'),
],
run=[
'cd examples',
'source .venv/bin/activate',
'cd mingpt',
'export LOGLEVEL=INFO',
'MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)',
'echo "Starting distributed training, head node: $MASTER_ADDR"',
('torchrun '
'--nnodes=$SKYPILOT_NUM_NODES '
'--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE '
'--rdzv_backend=c10d '
'--rdzv_endpoint=$MASTER_ADDR:29500 '
'--rdzv_id $SKYPILOT_TASK_ID '
'main.py'),
],
)
# Alternatively, load in the cluster YAML from a file
# task = sky.Task.from_yaml('../train_rdzv.yaml')
cluster_name = 'train-rdzv'
job_id, _ = sky.stream_and_get(sky.launch(task, cluster_name=cluster_name))
sky.tail_logs(cluster_name, job_id, follow=True)
print('Training completed. Downloading checkpoint...')
subprocess.run(
(f'scp {cluster_name}:~/sky_workdir/examples/mingpt/gpt_snapshot.pt '
'gpt_snapshot.pt'),
shell=True,
check=True)
print('Checkpoint downloaded.')
print(f'Tearing down cluster {cluster_name}...')
sky.stream_and_get(sky.down(cluster_name))
print(f'Cluster {cluster_name} torn down.')
train-rdzv.yaml
name: minGPT-ddp-rdzv
resources:
cpus: 4+
accelerators: L4:2
num_nodes: 2
setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt "numpy<2" "torch==2.7.1+cu118" --extra-index-url https://download.pytorch.org/whl/cu118
run: |
cd examples
source .venv/bin/activate
cd mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
train.yaml
name: minGPT-ddp
resources:
cpus: 4+
accelerators: L4:2
num_nodes: 2
setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt "numpy<2" "torch==2.7.1+cu118" --extra-index-url https://download.pytorch.org/whl/cu118
run: |
cd examples
source .venv/bin/activate
cd mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
# Explicit check for torchrun
if ! command -v torchrun >/dev/null 2>&1; then
echo "ERROR: torchrun command not found" >&2
exit 1
fi
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py