Source: examples/training/torchtitan
TorchTitan: Large-Scale LLM Training with SkyPilot#
TorchTitan is a PyTorch native platform for large-scale LLM training, featuring multi-dimensional parallelisms (FSDP2, Tensor/Pipeline/Context Parallel), distributed checkpointing, torch.compile, and Float8 support.
This example demonstrates how to run TorchTitan on your Kubernetes clusters, or any hypersclaers, neoclouds using SkyPilot, in addition to the instructions for runnning on Slurm.
Quick start#
Here is how to finetune Llama 3.1 on 2 nodes with 8 H100 (or 8 H200):
# Install SkyPilot (if not already installed)
# More cloud setup instructions in: https://docs.skypilot.co/en/latest/getting-started/installation.html
pip install "skypilot[kubernetes,aws]" # or your cloud: [gcp], [azure], etc.
# Launch a cluster and start training
export HF_TOKEN=... # if using a gated model from the HF Hub
sky launch -c torchtitan-multinode torchtitan.yaml --env HF_TOKEN
# Tail logs
sky logs torchtitan-multinode
# Terminate the cluster when done
sky down torchtitan-multinode
Configuration#
The provided torchtitan.yaml
configuration:
Sets up a 2-node cluster with 8 H100 (or H200) GPUs per node
Installs PyTorch nightly and TorchTitan requirements
Downloads the Llama 3.1 tokenizer
Available model configurations#
TorchTitan includes pre-configured training recipes for:
Llama 3.1 8B:
llama3_8b.toml
Llama 3.1 70B:
llama3_70b.toml
Llama 3.1 405B:
llama3_405b.toml
Each configuration file specifies model architecture, parallelism strategies, and training hyperparameters optimized for different scales.
To use a specific training recipe, you can set it through the CONFIG_FILE
env var:
sky launch -c torchtitan-multinode torchtitan.yaml \
--env HF_TOKEN \
--env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml # relative to the torchtitan's repo
Scaling Up#
To scale up your training, you can increase the number of nodes or try larger models:
# Scale to more nodes
sky launch -c torchtitan-8node torchtitan.yaml --num-nodes 8
# Try different model sizes (update CONFIG_FILE in torchtitan.yaml)
sky launch -c torchtitan-llama3-70b torchtitan.yaml --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml
Why SkyPilot for Distributed Training?#
Simple multi-node setup: SkyPilot automatically provides environment variables (
SKYPILOT_NODE_RANK
,SKYPILOT_NODE_IPS
, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.
Easily run on Kubernetes or clouds without code changes: SkyPilot offers a simple interface to run TorchTitan on any infrastructure:
sky launch --infra k8s torchtitan.yaml
Launch distributed training with a single command: SkyPilot automatically provides environment variables(
SKYPILOT_NODE_RANK
,SKYPILOT_NODE_IPS
, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.
Multi-node training details#
The configuration automatically:
Detects the head node IP and sets it as the master address
Configures the correct node rank for each node
Sets up the distributed environment for PyTorch’s torchrun with key settings:
--nnodes
: Uses$SKYPILOT_NUM_NODES
to specify total number of nodes--nproc_per_node
: Uses$SKYPILOT_NUM_GPUS_PER_NODE
for GPUs per node--node_rank
: Uses$SKYPILOT_NODE_RANK
to identify each node’s position--master_addr
: Extracts head node IP from$SKYPILOT_NODE_IPS
--master_port
: Sets communication port to 8008 for distributed coordination
Included files#
torchtitan.yaml
# SkyPilot configuration for TorchTitan multi-node training
# This configuration reproduces the functionality of multinode_trainer.slurm
#
# To launch:
# sky launch -c torchtitan-cluster sky.yaml
#
# To stop:
# sky down torchtitan-cluster
#
# To monitor:
# sky status --refresh
name: torchtitan-multinode
resources:
accelerators: {H100:8, H200:8}
disk_size: 1024GB
num_nodes: 2
workdir: .
envs:
CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml"
HF_TOKEN: ""
setup: |
git clone https://github.com/pytorch/torchtitan.git
cd torchtitan
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
pip install -r requirements.txt
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN
run: |
# Get head node IP (first node in the list)
HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Head node IP: $HEAD_NODE_IP"
# SKYPILOT_NODE_RANK is automatically set by SkyPilot
torchrun \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank $SKYPILOT_NODE_RANK \
--master_addr=$HEAD_NODE_IP \
--master_port=8008 \
-m torchtitan.train \
--job.config_file $CONFIG_FILE \
--training.dataset c4_test