Source: examples/cosmos3-finetuning
Fine-tuning NVIDIA Cosmos 3 with SkyPilot#
NVIDIA Cosmos 3
is a family of open omnimodal world foundation models for Physical AI:
robotics, autonomous vehicles, and smart spaces. Built on a Mixture-of-Transformers
architecture, a Cosmos 3 model pairs a reasoner tower (a vision-language model
over text/image/video/audio/action) with a generator tower (a diffusion model
that synthesizes future video/image/action). This example fine-tunes the smallest
member, Cosmos3-Nano (16B), as a SkyPilot managed job on Kubernetes, with
checkpoints on a SkyPilot volume
for auto-recovery.
Model |
Params |
Notes |
|---|---|---|
16B |
Workstation/efficient tier, used in this example |
|
64B |
Datacenter / frontier tier |
|
64B |
Image-to-video specialization |
It runs NVIDIA’s official supervised fine-tuning recipe
vision_sft_nano
from cosmos-framework: it
post-trains the Cosmos3-Nano generation pathway (text/image/video → video) with
FSDP across 8 GPUs in bfloat16. It trains on
nvidia/bridge-v2-subset-synthetic-captions,
a ~650 MB subset of BridgeData V2
robot-manipulation videos. To fine-tune on your own data, mount it as a second
volume (or a
cloud bucket), laid out like
train/video_dataset_file.jsonl, and pass --env DATASET_PATH=/path/to/it (see
docs/dataset_jsonl.md).
Run it#
You’ll need a Kubernetes cluster with 8× H100/H200 GPUs. Point SkyPilot at it and verify:
pip install "skypilot-nightly[kubernetes]"
sky check k8s
1. Create the checkpoint volume#
SkyPilot volumes are
Kubernetes PVCs with a lifecycle independent of any cluster — perfect for durable
checkpoints. Create one (mounted at /checkpoints, the recipe’s OUTPUT_ROOT) so
the managed job auto-resumes from the latest checkpoint after a recovery and the
outputs outlive the job’s cluster:
sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
This creates a 1 Ti cosmos3-checkpoints PVC. See cosmos3_checkpoints_volume.yaml
to set the size, storage class, or access mode for your cluster.
2. Launch the fine-tuning job#
sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml
SkyPilot schedules the job on a Kubernetes node with 8× H100/H200 and mounts the
cosmos3-checkpoints volume at /checkpoints. The model and dataset are public, so
no token is needed; for HF auth, export HF_TOKEN=... and add --secret HF_TOKEN.
The first run downloads ~35 GB in setup (base model + VAE + dataset; 30+ min, and
looks idle during the quiet downloads), then trains + exports.
Multiple Kubernetes clusters? Pin one with
--infra k8s/<context>. To run on a cloud instead, see Using a cloud bucket instead of a volume below.
Smoke test (a few steps to exercise the whole pipeline, still checkpoints + exports):
sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml \
--env MAX_ITER=10 --env SAVE_ITER=5
Monitor and manage it:
sky jobs queue # status of all managed jobs
sky jobs logs -n cosmos3 # stream logs
sky jobs cancel -n cosmos3
Tunable knobs (--env)#
Env var |
Default |
Meaning |
|---|---|---|
|
bridge subset |
Dataset dir the launcher trains on (override for your own data). |
|
|
Number of optimizer steps (set small for a smoke test). |
|
|
Save a DCP checkpoint every N steps. |
|
|
Export the trained checkpoint to HF safetensors. |
|
pinned commit |
cosmos-framework git ref to install. |
Outputs#
Checkpoints (checkpoints/iter_<N>/), the resolved config.yaml, and the exported
safetensors (model/) land on the cosmos3-checkpoints volume under
cosmos3/sft/vision_sft_nano/. The volume persists after the job finishes — inspect
it with sky volumes ls, or mount it from another SkyPilot task (e.g. a serving job)
with a volumes: block to read the exported model.
Bring your own dataset#
Put your data on a second volume and point the recipe at it. Create the volume (laid
out with train/video_dataset_file.jsonl; see the
dataset docs),
then in cosmos3_nano_finetune.yaml uncomment the dataset mount under volumes::
volumes:
/checkpoints: cosmos3-checkpoints
/my-dataset: my-dataset-volume
and launch with --env DATASET_PATH=/my-dataset.
Using a cloud bucket instead of a volume#
Most of this example is Kubernetes + volume centric, but nothing requires it. To run
on a cloud (or to keep checkpoints in object storage for cross-region access), drop
the volumes: block in cosmos3_nano_finetune.yaml and mount a bucket at
/checkpoints instead:
file_mounts:
/checkpoints:
name: my-cosmos3-checkpoints # globally-unique bucket name; SkyPilot creates it
mode: MOUNT
Then remove infra: kubernetes (or set --infra <cloud>) and SkyPilot picks the
cheapest cloud/region with 8× H100/H200. Add --use-spot for cheaper preemptible
GPUs — the job auto-resumes from the bucket after a preemption. See
Cloud Buckets.
References#
Cosmos 3 blog: https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/
Technical report: https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf
cosmos-framework: NVIDIA/cosmos-framework
NVIDIA Cosmos: NVIDIA/Cosmos
SkyPilot Volumes: https://docs.skypilot.co/en/stable/reference/volumes.html
Included files#
cosmos3_checkpoints_volume.yaml
# A persistent SkyPilot volume (Kubernetes PVC) for Cosmos3 checkpoints + exports.
#
# Create it once, then the fine-tuning job mounts it at /checkpoints so the
# managed job auto-resumes from the latest checkpoint after a recovery and the
# outputs outlive the job's cluster:
#
# sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
#
# See https://docs.skypilot.co/en/stable/reference/volumes.html
name: cosmos3-checkpoints
type: k8s-pvc
infra: k8s # or k8s/<context>
size: 1000Gi
config:
# ReadWriteMany lets the volume be reattached across managed-job recoveries
# (and shared by all nodes if you scale to multi-node). It needs a storage class
# that supports RWX (e.g. a shared filesystem such as JuiceFS, Nebius shared FS,
# AWS EFS, or GCP Filestore) -- set storage_class_name to one below. If your
# cluster only has ReadWriteOnce block storage, change access_mode to
# ReadWriteOnce (this single-node job works fine with it).
access_mode: ReadWriteMany
# storage_class_name: csi-mounted-fs-path-sc # omit to use the default StorageClass
cosmos3_nano_finetune.yaml
# Fine-tune NVIDIA Cosmos3-Nano (16B omnimodal world model) as a SkyPilot managed job.
#
# Runs NVIDIA's official `vision_sft_nano` SFT recipe from
# github.com/NVIDIA/cosmos-framework: an 8-GPU FSDP fine-tune of the Cosmos3-Nano
# generation pathway on the public nvidia/bridge-v2-subset-synthetic-captions
# robot-video dataset, then exports to Hugging Face safetensors. See README.md.
#
# Checkpoints are written to a mounted SkyPilot volume (a Kubernetes PVC), so the
# managed job auto-resumes from the latest checkpoint after a recovery, and the
# outputs survive job teardown.
#
# Usage (create the checkpoint volume once, then launch):
# sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
# sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml
# # Quick smoke test (a few steps, still checkpoints + exports):
# sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml \
# --env MAX_ITER=10 --env SAVE_ITER=5
name: cosmos3-nano-finetune
resources:
# Run on Kubernetes. Drop this line to let SkyPilot pick any infra (e.g. a cloud),
# or set `infra: k8s/<context>` to pin a specific cluster.
infra: kubernetes
# Cosmos3 needs Ampere or newer; NVIDIA's recipe is tested on 8x H100 80GB.
accelerators: {H100:8, H200:8}
# NVIDIA's recommended CUDA 13 base image; the training env layers on with `uv sync`.
image_id: docker:nvcr.io/nvidia/pytorch:25.09-py3
disk_size: 1000
num_nodes: 1
volumes:
# Durable checkpoint store, mounted at /checkpoints (the recipe's OUTPUT_ROOT).
# DCP checkpoints, the resolved config, and the exported safetensors all land on
# this volume, so they survive managed-job recovery (auto-resume) and post-job
# teardown. Create it first with:
# sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
/checkpoints: cosmos3-checkpoints
# Bring your own dataset: mount a second volume holding your data, then point the
# recipe at it with `--env DATASET_PATH=/my-dataset` (that dir must contain
# `train/video_dataset_file.jsonl`; see README), so it trains on your data instead
# of the bridge subset `setup` downloads by default. Uncomment and set your volume:
# /my-dataset: my-dataset-volume
envs:
COSMOS_FRAMEWORK_REF: 411d25b2e35bc441126f48c44a4b93e1c0564274 # pinned for reproducibility
BASE_CHECKPOINT_NAME: Cosmos3-Nano # cosmos-framework catalog name -> nvidia/Cosmos3-Nano on HF
# Dataset dir the launcher trains on. Defaults to the bridge subset downloaded in
# `setup`; override to fine-tune on your own data (see the bring-your-own-dataset
# note under volumes above).
DATASET_PATH: examples/data/bridge-v2-subset-synthetic-captions/sft_dataset_bridge
MAX_ITER: "500" # optimizer steps (recipe default 500; set small for a smoke test)
SAVE_ITER: "100" # save a DCP checkpoint every N steps
EXPORT_SAFETENSORS: "1" # export the trained checkpoint to HF safetensors (1=yes, 0=no)
secrets:
# Base model + dataset are public, so this defaults to empty (no token needed).
# To authenticate, `export HF_TOKEN=...` and pass `--secret HF_TOKEN`.
HF_TOKEN: ""
# Prefer a cloud bucket over a volume (e.g. for cross-region access)? Drop the
# `volumes:` block above and mount a bucket at /checkpoints instead:
# file_mounts:
# /checkpoints:
# name: my-cosmos3-checkpoints # globally-unique bucket name; SkyPilot creates it
# mode: MOUNT
# See https://docs.skypilot.co/en/latest/reference/storage.html
setup: |
set -e
# The NGC image auto-activates a conda base env; drop it so uv's venv is the only one.
conda deactivate 2>/dev/null || true
export DEBIAN_FRONTEND=noninteractive
SUDO=""; [ "$(id -u)" -ne 0 ] && SUDO="sudo"
$SUDO apt-get update -y
$SUDO apt-get install -y --no-install-recommends curl ffmpeg git git-lfs libx11-dev wget
if ! command -v uv >/dev/null 2>&1; then
curl -LsSf https://astral.sh/uv/install.sh | sh
fi
source "$HOME/.local/bin/env" 2>/dev/null || export PATH="$HOME/.local/bin:$PATH"
# Clone cosmos-framework at the pinned commit (skip large LFS assets).
cd "$HOME"
if [ ! -d cosmos-framework/.git ]; then
rm -rf cosmos-framework
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/NVIDIA/cosmos-framework.git
fi
cd cosmos-framework
git fetch origin "$COSMOS_FRAMEWORK_REF" 2>/dev/null || git fetch origin
git checkout "$COSMOS_FRAMEWORK_REF"
# Install the CUDA 13 training environment.
uv python install
uv sync --all-extras --group=cu130-train
source .venv/bin/activate
export LD_LIBRARY_PATH= # keep host CUDA libs out of the venv's torch
[ -n "${HF_TOKEN:-}" ] && export HF_TOKEN
# Download the dataset + Wan2.2 VAE into the launcher's default locations.
uvx hf@latest download --repo-type dataset nvidia/bridge-v2-subset-synthetic-captions \
--revision 46468e12ac0dd36901e9e3240d4fc7620942b5d7 \
--local-dir examples/data/bridge-v2-subset-synthetic-captions --quiet
uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \
--local-dir examples/checkpoints/wan22_vae --quiet
# Download Cosmos3-Nano and convert it to PyTorch Distributed Checkpoint (DCP) format.
if [ ! -d "examples/checkpoints/${BASE_CHECKPOINT_NAME}" ]; then
python -m cosmos_framework.scripts.convert_model_to_dcp \
-o "examples/checkpoints/${BASE_CHECKPOINT_NAME}" \
--checkpoint-path "${BASE_CHECKPOINT_NAME}"
fi
run: |
set -e
conda deactivate 2>/dev/null || true
cd "$HOME/cosmos-framework"
source .venv/bin/activate
export LD_LIBRARY_PATH=
[ -n "${HF_TOKEN:-}" ] && export HF_TOKEN
export NPROC_PER_NODE="${SKYPILOT_NUM_GPUS_PER_NODE}" # FSDP shards across every GPU
# Write all training outputs to the mounted checkpoint volume so the recipe
# auto-resumes from the latest checkpoint after a managed-job recovery.
export OUTPUT_ROOT=/checkpoints
RUN_DIR="$OUTPUT_ROOT/cosmos3/sft/vision_sft_nano"
# Set step count + checkpoint frequency in the recipe TOML.
TOML=examples/toml/sft_config/vision_sft_nano.toml
sed -i "s/^[[:space:]]*max_iter[[:space:]]*=.*/max_iter = ${MAX_ITER}/" "$TOML"
sed -i "s/^[[:space:]]*save_iter[[:space:]]*=.*/save_iter = ${SAVE_ITER}/" "$TOML"
# Launch multi-GPU FSDP supervised fine-tuning.
bash examples/launch_sft_vision_nano.sh
# Export the fine-tuned DCP checkpoint to Hugging Face safetensors.
if [ "${EXPORT_SAFETENSORS}" = "1" ] && [ -f "$RUN_DIR/checkpoints/latest_checkpoint.txt" ]; then
CKPT_ITER="$(cat "$RUN_DIR/checkpoints/latest_checkpoint.txt")"
python -m cosmos_framework.scripts.export_model \
--checkpoint-path "$RUN_DIR/checkpoints/$CKPT_ITER" \
--config-file "$RUN_DIR/config.yaml" \
-o "$RUN_DIR/model"
echo ">>> Fine-tuned safetensors written to: $RUN_DIR/model"
fi