Source: llm/llama-4-finetuning
Finetune Llama 4 on your infra#

Metaβs Llama 4 represents the next generation of open-source large language models, featuring advanced capabilities with the Llama-4-Maverick-17B-128E model - a 400B parameter (17B active) Mixture of Experts (MoE) architecture with 128 experts.
This guide shows how to use SkyPilot with torchtune and Llama Factory to finetune Llama 4 on your own infra. Everything is packaged in simple SkyPilot YAMLs, that can be launched with one command on your infra:
Kubernetes cluster
Cloud accounts (16+ clouds supported)
π Available Recipes#
Choose the right recipe for your needs:
Recipe |
Requirements |
Description |
---|---|---|
π llama-4-maverick-sft.yaml |
4 nodes |
Full finetuning using torchtune with CPU offloading with 400B model. Recommended if you have 32 or more H200s. |
π― llama-4-maverick-lora.yaml |
2 nodes |
Memory efficient - LoRA fine-tuning with lower resource requirements. Great for limited GPU resources. |
π llama-4-scout-sft.yaml |
2 nodes |
Full finetuning using torchtune with 109B model. Good start for users with H100s. |
Full finetuning with CPU offloading#
This approach uses torchtune to do full supervised fine-tuning with CPU offloading to reduce GPU memory requirements. Requires 32 or more H200s.
SkyPilot YAML: llama-4-maverick-sft.yaml
Run the following on your local machine:
SkyPilot YAML for finetuning Llama 4: llama-4-maverick-sft.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama-4-maverick-sft.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 4 nodes with 8x H200 GPUs each.
envs:
HF_TOKEN:
resources:
cpus: 100+
memory: 1000+
accelerators: H200:8
disk_tier: best
num_nodes: 4
# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
# /dataset:
# source: s3://my-dataset-bucket
# mode: COPY # COPY mode will prefetch the dataset to the node for faster access
# /checkpoints:
# source: s3://my-checkpoint-bucket
# mode: MOUNT_CACHED # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes
setup: |
# Install torch and torchtune nightly builds
pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Download the model (~700 GB, may take time to download)
tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--hf-token $HF_TOKEN
run: |
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed finetuning, head node: $MASTER_ADDR"
tune run \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_id $SKYPILOT_TASK_ID \
--rdzv_backend c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
full_finetune_distributed \
--config llama4/maverick_17B_128E_full \
model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct \
dataset.packed=True tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
enable_activation_offloading=True \
activation_offloading_use_streams=False \
optimizer_in_bwd=True \
optimizer=torch.optim.AdamW \
optimizer_kwargs.fused=True \
max_steps_per_epoch=1 \
epochs=10 \
enable_dcp=True \
enable_async_checkpointing=True \
resume_from_checkpoint=False \
keep_last_n_checkpoints=1 \
fsdp_cpu_offload=True
Run the following on your local machine:
# Download the files for Llama 4 finetuning
git clone https://github.com/skypilot-org/skypilot
cd skypilot/llm/llama-4-finetuning
export HF_TOKEN=xxxx
sky launch -c maverick-torchtune llama-4-maverick-sft.yaml \
--env HF_TOKEN
Alternative Approaches#
LoRA Fine-tuning (Lower Resource Requirements)#
For users with limited GPU resources, LoRA (Low-Rank Adaptation) provides an efficient alternative that can run on 16 H100s:
# LoRA finetuning - requires fewer resources
sky launch -c maverick-lora llama-4-maverick-lora.yaml \
--env HF_TOKEN
Appendix: Preparation#
Request the access to Llama 4 weights on huggingface (Click on the blue box and follow the steps).
Get your huggingface access token:
Add huggingface token to your environment variable:
export HF_TOKEN="xxxx"
Install SkyPilot for launching the finetuning:
pip install skypilot-nightly[aws,gcp,kubernetes]
# or other clouds (16 clouds + kubernetes supported) you have setup
# See: https://docs.skypilot.co/en/latest/getting-started/installation.html
Check your infra setup:
sky check
π Enabled clouds π
β AWS
β GCP
β Azure
β OCI
β Lambda
β RunPod
β Paperspace
β Fluidstack
β Cudo
β IBM
β SCP
β vSphere
β Cloudflare (for R2 object store)
β Kubernetes
Whatβs next#
Included files#
configs/llama4_lora_sft.yaml
# pip install git+https://github.com/hiyouga/transformers.git@llama4_train
### model
model_name_or_path: meta-llama/Llama-4-Maverick-17B-128E-Instruct
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: examples/deepspeed/ds_z3_offload_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
### dataset
dataset: mllm_demo,identity,alpaca_en_demo
template: llama4
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: saves/llama4-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
### eval
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
configs/llama4_maverick_full_sft_cpu.yaml
# pip install git+https://github.com/hiyouga/transformers.git@llama4_train
### model
model_name_or_path: meta-llama/Llama-4-Maverick-17B-128E-Instruct
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: full
# deepspeed: examples/deepspeed/ds_z2_offload_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
deepspeed: /configs/offload_cpu.yaml
flash_attn: fa2
enable_liger_kernel: True
### dataset
dataset: alpaca_en_demo
template: llama4
cutoff_len: 128
max_samples: 100
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 1
### output
output_dir: saves/llama4-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
### eval
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
configs/offload_cpu.yaml
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1000000000,
"reduce_bucket_size": 1000000000,
"stage3_prefetch_bucket_size": 200000000,
"stage3_param_persistence_threshold": 1000000,
"stage3_max_live_parameters": 2000000000,
"stage3_max_reuse_distance": 2000000000,
"stage3_gather_16bit_weights_on_model_save": true
}
}
llama-4-maverick-lora.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama-4-maverick-lora.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H100 GPUs each.
envs:
HF_TOKEN:
resources:
infra: k8s
cpus: 100+
memory: 1000+
accelerators: H100:8
disk_tier: best
network_tier: best
num_nodes: 2
# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
# /dataset:
# source: s3://my-dataset-bucket
# mode: COPY # COPY mode will prefetch the dataset to the node for faster access
# /checkpoints:
# source: s3://my-checkpoint-bucket
# mode: MOUNT_CACHED # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes
file_mounts:
/configs: ./configs
setup: |
conda create -n training python=3.10 -y
conda activate training
# Download the repository configuration package
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
# Install the keyring package
sudo dpkg -i cuda-keyring_1.1-1_all.deb
# Update package list
sudo apt-get update
#sudo apt-get install cuda-minimal-build-12-6 -y
sudo apt-get install cuda-toolkit-12-6 -y
git clone -b v0.9.3 --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed]" --no-build-isolation
pip install "transformers>=4.51.1"
run: |
conda activate training
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed finetuning, head node: $MASTER_ADDR"
cd LLaMA-Factory
HF_TOKEN=$HF_TOKEN FORCE_TORCHRUN=1 NNODES=$SKYPILOT_NUM_NODES NODE_RANK=$SKYPILOT_NODE_RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=29500 llamafactory-cli train /configs/llama4_lora_sft.yaml
llama-4-maverick-sft.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama-4-maverick-sft.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 4 nodes with 8x H200 GPUs each.
envs:
HF_TOKEN:
resources:
cpus: 100+
memory: 1000+
accelerators: H200:8
disk_tier: best
num_nodes: 4
# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
# /dataset:
# source: s3://my-dataset-bucket
# mode: COPY # COPY mode will prefetch the dataset to the node for faster access
# /checkpoints:
# source: s3://my-checkpoint-bucket
# mode: MOUNT_CACHED # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes
setup: |
conda create -n training python=3.10 -y
conda activate training
# Install torch and torchtune nightly builds
pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Download the model (~700 GB, may take time to download)
tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--hf-token $HF_TOKEN
run: |
conda activate training
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed finetuning, head node: $MASTER_ADDR"
tune run \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_id $SKYPILOT_TASK_ID \
--rdzv_backend c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
full_finetune_distributed \
--config llama4/maverick_17B_128E_full \
model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct \
dataset.packed=True tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
enable_activation_offloading=True \
activation_offloading_use_streams=False \
optimizer_in_bwd=True \
optimizer=torch.optim.AdamW \
optimizer_kwargs.fused=True \
max_steps_per_epoch=1 \
epochs=10 \
enable_dcp=True \
enable_async_checkpointing=True \
resume_from_checkpoint=False \
keep_last_n_checkpoints=1 \
fsdp_cpu_offload=True
llama-4-maverick.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama-4-maverick.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H200 GPUs each.
envs:
HF_TOKEN:
resources:
cpus: 100+
memory: 1000+
accelerators: H200:8
disk_tier: best
num_nodes: 2
# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
# /dataset:
# source: s3://my-dataset-bucket
# mode: COPY # COPY mode will prefetch the dataset to the node for faster access
# /checkpoints:
# source: s3://my-checkpoint-bucket
# mode: MOUNT_CACHED # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes
setup: |
conda create -n training python=3.10 -y
conda activate training
# Install torch and torchtune nightly builds
pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Download the model (~700 GB, may take time to download)
tune download meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--hf-token $HF_TOKEN
run: |
conda activate training
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed finetuning, head node: $MASTER_ADDR"
tune run \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_id $SKYPILOT_TASK_ID \
--rdzv_backend c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
full_finetune_distributed \
--config llama4/maverick_17B_128E_full \
model_dir=/tmp/Llama-4-Maverick-17B-128E-Instruct
llama-4-scout-sft.yaml
# Full finetuning of Llama-4 Maverick 17B MoE model with 128 experts.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama-4-maverick.yaml -c maverick --env HF_TOKEN
#
# This config requires at least 2 nodes with 8x H200 GPUs each.
envs:
HF_TOKEN:
resources:
cpus: 100+
memory: 1000+
accelerators: H100:8
disk_tier: best
num_nodes: 2
# Optional: configure buckets for dataset and checkpoints. You can then use the /outputs directory to write checkpoints.
# file_mounts:
# /dataset:
# source: s3://my-dataset-bucket
# mode: COPY # COPY mode will prefetch the dataset to the node for faster access
# /checkpoints:
# source: s3://my-checkpoint-bucket
# mode: MOUNT_CACHED # MOUNT_CACHED mode will intelligently cache the checkpoint for faster writes
setup: |
conda create -n training python=3.10 -y
conda activate training
# Install torch and torchtune nightly builds
pip install --pre --upgrade torch==2.8.0.dev20250610+cu126 torchvision==0.23.0.dev20250610+cu126 torchao==0.12.0.dev20250611+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu124/cu126/xpu/rocm6.2/rocm6.3/rocm6.4
pip install --pre --upgrade torchtune==0.7.0.dev20250610+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Download the model (~200 GB, may take time to download)
tune download meta-llama/Llama-4-Scout-17B-16E-Instruct \
--hf-token $HF_TOKEN
run: |
conda activate training
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed finetuning, head node: $MASTER_ADDR"
tune run \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_id $SKYPILOT_TASK_ID \
--rdzv_backend c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
full_finetune_distributed \
--config llama4/scout_17B_16E_full \
model_dir=/tmp/Llama-4-Scout-17B-16E-Instruct \
max_steps_per_epoch=10 \
epochs=1