Source: llm/llama-4-finetuning

Finetune Llama 4 on your infra#

Meta’s Llama 4 represents the next generation of open-source large language models, featuring advanced capabilities with the Llama-4-Maverick-17B-128E model - a 400B parameter (17B active) Mixture of Experts (MoE) architecture with 128 experts.

This guide shows how to use SkyPilot with torchtune and Llama Factory to finetune Llama 4 on your own infra. Everything is packaged in simple SkyPilot YAMLs, that can be launched with one command on your infra:

πŸ“ Available Recipes#

Choose the right recipe for your needs:

Recipe

Requirements

Description

🌟 llama-4-maverick-sft.yaml

4 nodes
32x H200 GPUs
1000+ GB CPU memory per node

Full finetuning using torchtune with CPU offloading with 400B model. Recommended if you have 32 or more H200s.

🎯 llama-4-maverick-lora.yaml

2 nodes
16x H100 GPUs
1000+ GB CPU memory per node

Memory efficient - LoRA fine-tuning with lower resource requirements. Great for limited GPU resources.

πŸš€ llama-4-scout-sft.yaml

2 nodes
16x H100 GPUs
1000+ GB CPU memory per node

Full finetuning using torchtune with 109B model. Good start for users with H100s.

Full finetuning with CPU offloading#

This approach uses torchtune to do full supervised fine-tuning with CPU offloading to reduce GPU memory requirements. Requires 32 or more H200s.

SkyPilot YAML: llama-4-maverick-sft.yaml

Run the following on your local machine:

SkyPilot YAML for finetuning Llama 4: llama-4-maverick-sft.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick-sft.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 4 nodes with 8x H200 GPUs each.

envs:
  HF_TOKEN: 

resources:
  cpus: 100+
  memory: 1000+
  accelerators: H200:8
  disk_tier: best

num_nodes: 4

# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

setup: |
  # Install torch and torchtune nightly builds
  pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
  pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

  # Download the model (~700 GB, may take time to download)
  tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --hf-token $HF_TOKEN

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config llama4/maverick_17B_128E_full \
  model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct \
  dataset.packed=True tokenizer.max_seq_len=4096 \
  gradient_accumulation_steps=1 \
  enable_activation_offloading=True \
  activation_offloading_use_streams=False \
  optimizer_in_bwd=True \
  optimizer=torch.optim.AdamW \
  optimizer_kwargs.fused=True \
  max_steps_per_epoch=1 \
  epochs=10 \
  enable_dcp=True \
  enable_async_checkpointing=True \
  resume_from_checkpoint=False \
  keep_last_n_checkpoints=1 \
  fsdp_cpu_offload=True

Run the following on your local machine:

# Download the files for Llama 4 finetuning
git clone https://github.com/skypilot-org/skypilot
cd skypilot/llm/llama-4-finetuning

export HF_TOKEN=xxxx
sky launch -c maverick-torchtune llama-4-maverick-sft.yaml \
  --env HF_TOKEN

Alternative Approaches#

LoRA Fine-tuning (Lower Resource Requirements)#

For users with limited GPU resources, LoRA (Low-Rank Adaptation) provides an efficient alternative that can run on 16 H100s:

# LoRA finetuning - requires fewer resources
sky launch -c maverick-lora llama-4-maverick-lora.yaml \
  --env HF_TOKEN

Appendix: Preparation#

  1. Request the access to Llama 4 weights on huggingface (Click on the blue box and follow the steps).

  2. Get your huggingface access token:

  3. Add huggingface token to your environment variable:

export HF_TOKEN="xxxx"
  1. Install SkyPilot for launching the finetuning:

pip install skypilot-nightly[aws,gcp,kubernetes] 
# or other clouds (16 clouds + kubernetes supported) you have setup
# See: https://docs.skypilot.co/en/latest/getting-started/installation.html
  1. Check your infra setup:

sky check

πŸŽ‰ Enabled clouds πŸŽ‰
    βœ” AWS
    βœ” GCP
    βœ” Azure
    βœ” OCI
    βœ” Lambda
    βœ” RunPod
    βœ” Paperspace
    βœ” Fluidstack
    βœ” Cudo
    βœ” IBM
    βœ” SCP
    βœ” vSphere
    βœ” Cloudflare (for R2 object store)
    βœ” Kubernetes

What’s next#

Included files#