Finetune Llama 4 on your infra#

Meta’s Llama 4 represents the next generation of open-source large language models, featuring advanced capabilities with the Llama-4-Maverick-17B-128E model - a 400B parameter (17B active) Mixture of Experts (MoE) architecture with 128 experts.

This guide shows how to use SkyPilot with torchtune and Llama Factory to finetune Llama 4 on your own infra. Everything is packaged in simple SkyPilot YAMLs, that can be launched with one command on your infra:

Kubernetes cluster
Cloud accounts (16+ clouds supported)

📁 Available Recipes#

Choose the right recipe for your needs:

Recipe	Requirements	Description
🌟 llama-4-maverick-sft.yaml	4 nodes 32x H200 GPUs 1000+ GB CPU memory per node	Full finetuning using torchtune with CPU offloading with 400B model. Recommended if you have 32 or more H200s.
🎯 llama-4-maverick-lora.yaml	2 nodes 16x H100 GPUs 1000+ GB CPU memory per node	Memory efficient - LoRA fine-tuning with lower resource requirements. Great for limited GPU resources.
🚀 llama-4-scout-sft.yaml	2 nodes 16x H100 GPUs 1000+ GB CPU memory per node	Full finetuning using torchtune with 109B model. Good start for users with H100s.

Full finetuning with CPU offloading#

This approach uses torchtune to do full supervised fine-tuning with CPU offloading to reduce GPU memory requirements. Requires 32 or more H200s.

SkyPilot YAML: llama-4-maverick-sft.yaml

Run the following on your local machine:

SkyPilot YAML for finetuning Llama 4: llama-4-maverick-sft.yaml

# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick-sft.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 4 nodes with 8x H200 GPUs each.

envs:
  HF_TOKEN:

resources:
  cpus: 100+
  memory: 1000+
  accelerators: H200:8
  disk_tier: best

num_nodes: 4

# Optional: configure buckets for dataset and checkpoints. You can then use the
# /checkpoints directory to write checkpoints, which writes to local disk first
# and asynchronously uploads to the cloud bucket. Pass /checkpoints to the main
# training script.
# file_mounts:
# file_mounts:
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

setup: |
  # Install torch and torchtune nightly builds
  pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
  pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

  # Download the model (~700 GB, may take time to download)
  tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --hf-token $HF_TOKEN

run: |
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config llama4/maverick_17B_128E_full \
  model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct \
  dataset.packed=True tokenizer.max_seq_len=4096 \
  gradient_accumulation_steps=1 \
  enable_activation_offloading=True \
  activation_offloading_use_streams=False \
  optimizer_in_bwd=True \
  optimizer=torch.optim.AdamW \
  optimizer_kwargs.fused=True \
  max_steps_per_epoch=1 \
  epochs=10 \
  enable_dcp=True \
  enable_async_checkpointing=True \
  resume_from_checkpoint=False \
  keep_last_n_checkpoints=1 \
  fsdp_cpu_offload=True

Run the following on your local machine:

# Download the files for Llama 4 finetuning
git clone https://github.com/skypilot-org/skypilot
cd skypilot/llm/llama-4-finetuning

export HF_TOKEN=xxxx
sky launch -c maverick-torchtune llama-4-maverick-sft.yaml \
  --env HF_TOKEN

Alternative Approaches#

LoRA Fine-tuning (Lower Resource Requirements)#

For users with limited GPU resources, LoRA (Low-Rank Adaptation) provides an efficient alternative that can run on 16 H100s:

# LoRA finetuning - requires fewer resources
sky launch -c maverick-lora llama-4-maverick-lora.yaml \
  --env HF_TOKEN

Appendix: Preparation#

Request the access to Llama 4 weights on huggingface (Click on the blue box and follow the steps).
Get your huggingface access token:
Add huggingface token to your environment variable:

export HF_TOKEN="xxxx"

Install SkyPilot for launching the finetuning:

pip install skypilot-nightly[aws,gcp,kubernetes]
# or other clouds (16 clouds + kubernetes supported) you have setup
# See: https://docs.skypilot.co/en/latest/getting-started/installation.html

Check your infra setup:

sky check

🎉 Enabled clouds 🎉
    ✔ AWS
    ✔ GCP
    ✔ Azure
    ✔ OCI
    ✔ Lambda
    ✔ RunPod
    ✔ Paperspace
    ✔ Fluidstack
    ✔ Cudo
    ✔ IBM
    ✔ SCP
    ✔ vSphere
    ✔ Cloudflare (for R2 object store)
    ✔ Kubernetes

What’s next#

Included files#

configs/llama4_lora_sft.yaml

# pip install git+https://github.com/hiyouga/transformers.git@llama4_train

### model
model_name_or_path: meta-llama/Llama-4-Maverick-17B-128E-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: examples/deepspeed/ds_z3_offload_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: mllm_demo,identity,alpaca_en_demo
template: llama4
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/llama4-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

configs/llama4_maverick_full_sft_cpu.yaml

# pip install git+https://github.com/hiyouga/transformers.git@llama4_train

### model
model_name_or_path: meta-llama/Llama-4-Maverick-17B-128E-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
# deepspeed: examples/deepspeed/ds_z2_offload_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
deepspeed: /configs/offload_cpu.yaml
flash_attn: fa2
enable_liger_kernel: True

### dataset
dataset: alpaca_en_demo
template: llama4
cutoff_len: 128
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 1

### output
output_dir: saves/llama4-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### eval
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

configs/offload_cpu.yaml

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1000000000,
    "reduce_bucket_size": 1000000000,
    "stage3_prefetch_bucket_size": 200000000,
    "stage3_param_persistence_threshold": 1000000,
    "stage3_max_live_parameters": 2000000000,
    "stage3_max_reuse_distance": 2000000000,
    "stage3_gather_16bit_weights_on_model_save": true
  }
} 

llama-4-maverick-lora.yaml

# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick-lora.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H100 GPUs each.

envs:
  HF_TOKEN:

resources:
  infra: k8s
  cpus: 100+
  memory: 1000+
  accelerators: H100:8
  disk_tier: best
  network_tier: best

num_nodes: 2

# Optional: configure buckets for dataset and checkpoints. You can then use the
# /checkpoints directory to write checkpoints, which writes to local disk first
# and asynchronously uploads to the cloud bucket. Pass /checkpoints to the main
# training script.
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

file_mounts:
  /configs: ./configs

setup: |
  conda create -n training python=3.10 -y
  conda activate training

  # Download the repository configuration package
  wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

  # Install the keyring package
  sudo dpkg -i cuda-keyring_1.1-1_all.deb

  # Update package list
  sudo apt-get update

  #sudo apt-get install cuda-minimal-build-12-6 -y
  sudo apt-get install cuda-toolkit-12-6 -y

  git clone -b v0.9.3 --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
  cd LLaMA-Factory
  pip install -e ".[torch,metrics,deepspeed]" --no-build-isolation
  pip install "transformers>=4.51.1"


run: |
  conda activate training

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  cd LLaMA-Factory

  HF_TOKEN=$HF_TOKEN FORCE_TORCHRUN=1 NNODES=$SKYPILOT_NUM_NODES NODE_RANK=$SKYPILOT_NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=29500 llamafactory-cli train /configs/llama4_lora_sft.yaml

llama-4-maverick-sft.yaml

# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick-sft.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 4 nodes with 8x H200 GPUs each.

envs:
  HF_TOKEN:

resources:
  cpus: 100+
  memory: 1000+
  accelerators: H200:8
  disk_tier: best

num_nodes: 4

# Optional: configure buckets for dataset and checkpoints. You can then use the
# /checkpoints directory to write checkpoints, which writes to local disk first
# and asynchronously uploads to the cloud bucket. Pass /checkpoints to the main
# training script.
# file_mounts:
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

setup: |
  conda create -n training python=3.10 -y
  conda activate training

  # Install torch and torchtune nightly builds
  pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
  pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

  # Download the model (~700 GB, may take time to download)
  tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --hf-token $HF_TOKEN

run: |
  conda activate training

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config llama4/maverick_17B_128E_full \
  model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct \
  dataset.packed=True tokenizer.max_seq_len=4096 \
  gradient_accumulation_steps=1 \
  enable_activation_offloading=True \
  activation_offloading_use_streams=False \
  optimizer_in_bwd=True \
  optimizer=torch.optim.AdamW \
  optimizer_kwargs.fused=True \
  max_steps_per_epoch=1 \
  epochs=10 \
  enable_dcp=True \
  enable_async_checkpointing=True \
  resume_from_checkpoint=False \
  keep_last_n_checkpoints=1 \
  fsdp_cpu_offload=True

llama-4-maverick.yaml

# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H200 GPUs each.

envs:
  HF_TOKEN:

resources:
  cpus: 100+
  memory: 1000+
  accelerators: H200:8
  disk_tier: best

num_nodes: 2

# Optional: configure buckets for dataset and checkpoints. You can then use the
# /checkpoints directory to write checkpoints, which writes to local disk first
# and asynchronously uploads to the cloud bucket. Pass /checkpoints to the main
# training script.
# file_mounts:
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

setup: |
  conda create -n training python=3.10 -y
  conda activate training

  # Install torch and torchtune nightly builds
  pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
  pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

  # Download the model (~700 GB, may take time to download)
  tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --hf-token $HF_TOKEN

run: |
  conda activate training

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config llama4/maverick_17B_128E_full \
  model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct

llama-4-scout-sft.yaml

# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama-4-maverick.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H200 GPUs each.

envs:
  HF_TOKEN:

resources:
  cpus: 100+
  memory: 1000+
  accelerators: H100:8
  disk_tier: best

num_nodes: 2

# Optional: configure buckets for dataset and checkpoints. You can then use the
# /checkpoints directory to write checkpoints, which writes to local disk first
# and asynchronously uploads to the cloud bucket. Pass /checkpoints to the main
# training script.
# file_mounts:
#  /dataset:
#    source: s3://my-dataset-bucket
#    mode: COPY  # COPY mode will prefetch the dataset to the node for faster access
#  /checkpoints:
#    source: s3://my-checkpoint-bucket
#    mode: MOUNT_CACHED  # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes

setup: |
  conda create -n training python=3.10 -y
  conda activate training

  # Install torch and torchtune nightly builds
  pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
  pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

  # Download the model (~200 GB, may take time to download)
  tune download meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --hf-token $HF_TOKEN

run: |
  conda activate training

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  echo "Starting distributed finetuning, head node: $MASTER_ADDR"

  tune run \
  --nnodes $SKYPILOT_NUM_NODES \
  --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
  --rdzv_id $SKYPILOT_TASK_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint=$MASTER_ADDR:29500 \
  full_finetune_distributed \
  --config llama4/scout_17B_16E_full \
  model_dir=/tmp/Llama-4-Scout-17B-16E-Instruct \
  max_steps_per_epoch=10 \
  epochs=1