Source: examples/training/torchtitan

TorchTitan: Large-Scale LLM Training with SkyPilot#

TorchTitan is a PyTorch native platform for large-scale LLM training, featuring multi-dimensional parallelisms (FSDP2, Tensor/Pipeline/Context Parallel), distributed checkpointing, torch.compile, and Float8 support.

This example demonstrates how to run TorchTitan on your Kubernetes clusters, or any hyperscalers, neoclouds using SkyPilot, in addition to the instructions for running on Slurm.

Quick start#

Here is how to finetune Llama 3.1 on 2 nodes with 8 H100 (or 8 H200):

# Install SkyPilot (if not already installed)
# More cloud setup instructions in: https://docs.skypilot.co/en/latest/getting-started/installation.html
pip install "skypilot[kubernetes,aws]"  # or your cloud: [gcp], [azure], etc.

# Launch a cluster and start training
export HF_TOKEN=... # if using a gated model from the HF Hub
sky launch -c torchtitan-multinode torchtitan.yaml --env HF_TOKEN

# Tail logs
sky logs torchtitan-multinode

# Terminate the cluster when done
sky down torchtitan-multinode

Configuration#

The provided torchtitan.yaml configuration:

Sets up a 2-node cluster with 8 H100 (or H200) GPUs per node
Installs PyTorch nightly and TorchTitan requirements
Downloads the Llama 3.1 tokenizer

Available model configurations#

TorchTitan includes pre-configured training recipes for:

Llama 3.1 8B: llama3_8b.toml
Llama 3.1 70B: llama3_70b.toml
Llama 3.1 405B: llama3_405b.toml

Each configuration file specifies model architecture, parallelism strategies, and training hyperparameters optimized for different scales.

To use a specific training recipe, you can set it through the CONFIG_FILE env var:

sky launch -c torchtitan-multinode torchtitan.yaml \
   --env HF_TOKEN \
   --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml # relative to the torchtitan's repo

Scaling Up#

To scale up your training, you can increase the number of nodes or try larger models:

# Scale to more nodes
sky launch -c torchtitan-8node torchtitan.yaml --num-nodes 8

# Try different model sizes (update CONFIG_FILE in torchtitan.yaml)
sky launch -c torchtitan-llama3-70b torchtitan.yaml --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml

Why SkyPilot for Distributed Training?#

Simple multi-node setup: SkyPilot automatically provides environment variables (SKYPILOT_NODE_RANK, SKYPILOT_NODE_IPS, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.
Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.
Easily run on Kubernetes or clouds without code changes: SkyPilot offers a simple interface to run TorchTitan on any infrastructure: sky launch --infra k8s torchtitan.yaml
Launch distributed training with a single command: SkyPilot automatically provides environment variables(SKYPILOT_NODE_RANK, SKYPILOT_NODE_IPS, etc.) that integrate seamlessly with PyTorch distributed training - no manual networking configuration needed.
Auto-recovery: Built-in fault tolerance automatically recovers from node failures and spot preemptions, resuming from checkpoints.

Multi-node training details#

The configuration automatically:

Detects the head node IP and sets it as the master address
Configures the correct node rank for each node
Sets up the distributed environment for PyTorch’s torchrun with key settings:
- --nnodes: Uses $SKYPILOT_NUM_NODES to specify total number of nodes
- --nproc_per_node: Uses $SKYPILOT_NUM_GPUS_PER_NODE for GPUs per node
- --node_rank: Uses $SKYPILOT_NODE_RANK to identify each node’s position
- --master_addr: Extracts head node IP from $SKYPILOT_NODE_IPS
- --master_port: Sets communication port to 8008 for distributed coordination

Included files#

torchtitan.yaml

# SkyPilot configuration for TorchTitan multi-node training
# This configuration reproduces the functionality of multinode_trainer.slurm
#
# To launch:
#   sky launch -c torchtitan-cluster sky.yaml
#
# To stop:
#   sky down torchtitan-cluster
#
# To monitor:
#   sky status --refresh

name: torchtitan-multinode

resources:
  accelerators: {H100:8, H200:8}
  disk_size: 1024GB

num_nodes: 2

workdir: .

envs:
  CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml"
  HF_TOKEN: ""

setup: |
  git clone https://github.com/pytorch/torchtitan.git
  cd torchtitan
  pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
  pip install -r requirements.txt
  python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN

run: |
  # Get head node IP (first node in the list)
  HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Head node IP: $HEAD_NODE_IP"
  
  # SKYPILOT_NODE_RANK is automatically set by SkyPilot
  torchrun \
    --nnodes $SKYPILOT_NUM_NODES \
    --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank $SKYPILOT_NODE_RANK \
    --master_addr=$HEAD_NODE_IP \
    --master_port=8008 \
    -m torchtitan.train \
    --job.config_file $CONFIG_FILE \
    --training.dataset c4_test