TorchTitan: Large-Scale LLM Training with SkyPilot#

This example shows how to train TorchTitan models using SkyPilot’s multi-node capabilities.

TorchTitan is a PyTorch native platform designed for rapid experimentation and large-scale training of generative AI models, featuring:

Multi-dimensional composable parallelisms (FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel)
Distributed checkpointing
torch.compile support
Float8 support
And many more optimizations for training LLMs at scale

Quick start#

# Install SkyPilot (if not already installed)
pip install "skypilot[kubernetes,aws]"  # or your cloud: [gcp], [azure], etc.

# Launch a cluster and start training
export HF_TOKEN=... # if using a gated model from the HF Hub
sky launch -c torchtitan-multinode torchtitan.yaml --env HF_TOKEN

# Tail logs
sky logs torchtitan-multinode

# Stop the cluster when done
sky down torchtitan-multinode

Configuration#

The provided torchtitan.yaml configuration:

Sets up a 2-node cluster with 8 H100 (or H200) GPUs per node
Installs PyTorch nightly and TorchTitan requirements
Downloads the Llama 3.1 tokenizer
Runs distributed training using torchrun

Customizing the configuration#

You can override various parameters without editing the YAML file:

# Use 4 nodes and train a larger model
sky launch -c torchtitan-multinode torchtitan.yaml \
   --num-nodes 4 \
   --env HF_TOKEN \
   --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml # relative to the torchtitan's repo

Available model configurations#

TorchTitan includes pre-configured training recipes for:

Llama 3.1 8B: llama3_8b.toml
Llama 3.1 70B: llama3_70b.toml
Llama 3.1 405B: llama3_405b.toml

Each configuration file specifies model architecture, parallelism strategies, and training hyperparameters optimized for different scales.

Multi-node training details#

The configuration automatically:

Detects the head node IP and sets it as the master address
Configures the correct node rank for each node
Sets up the distributed environment for PyTorch’s torchrun

Cost optimization#

To reduce costs:

Use spot instances: Add use_spot: true to the resources section
Use smaller GPU types for experimentation (e.g., A100 instead of H100)
Adjust the number of nodes based on your training requirements