Source: examples/cosmos3-finetuning

Fine-tuning NVIDIA Cosmos 3 with SkyPilot#

NVIDIA Cosmos 3 is a family of open omnimodal world foundation models for Physical AI: robotics, autonomous vehicles, and smart spaces. Built on a Mixture-of-Transformers architecture, a Cosmos 3 model pairs a reasoner tower (a vision-language model over text/image/video/audio/action) with a generator tower (a diffusion model that synthesizes future video/image/action). This example fine-tunes the smallest member, Cosmos3-Nano (16B), as a SkyPilot managed job on Kubernetes, with checkpoints on a SkyPilot volume for auto-recovery.

Model	Params	Notes
`nvidia/Cosmos3-Nano`	16B	Workstation/efficient tier, used in this example
`nvidia/Cosmos3-Super`	64B	Datacenter / frontier tier
`nvidia/Cosmos3-Super-Image2Video`	64B	Image-to-video specialization

It runs NVIDIA’s official supervised fine-tuning recipe vision_sft_nano from cosmos-framework: it post-trains the Cosmos3-Nano generation pathway (text/image/video → video) with FSDP across 8 GPUs in bfloat16. It trains on nvidia/bridge-v2-subset-synthetic-captions, a ~650 MB subset of BridgeData V2 robot-manipulation videos. To fine-tune on your own data, mount it as a second volume (or a cloud bucket), laid out like train/video_dataset_file.jsonl, and pass --env DATASET_PATH=/path/to/it (see docs/dataset_jsonl.md).

Run it#

You’ll need a Kubernetes cluster with 8× H100/H200 GPUs. Point SkyPilot at it and verify:

pip install "skypilot-nightly[kubernetes]"
sky check k8s

1. Create the checkpoint volume#

SkyPilot volumes are Kubernetes PVCs with a lifecycle independent of any cluster — perfect for durable checkpoints. Create one (mounted at /checkpoints, the recipe’s OUTPUT_ROOT) so the managed job auto-resumes from the latest checkpoint after a recovery and the outputs outlive the job’s cluster:

sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml

This creates a 1 Ti cosmos3-checkpoints PVC. See cosmos3_checkpoints_volume.yaml to set the size, storage class, or access mode for your cluster.

2. Launch the fine-tuning job#

sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml

SkyPilot schedules the job on a Kubernetes node with 8× H100/H200 and mounts the cosmos3-checkpoints volume at /checkpoints. The model and dataset are public, so no token is needed; for HF auth, export HF_TOKEN=... and add --secret HF_TOKEN. The first run downloads ~35 GB in setup (base model + VAE + dataset; 30+ min, and looks idle during the quiet downloads), then trains + exports.

Multiple Kubernetes clusters? Pin one with --infra k8s/<context>. To run on a cloud instead, see Using a cloud bucket instead of a volume below.

Smoke test (a few steps to exercise the whole pipeline, still checkpoints + exports):

sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml \
    --env MAX_ITER=10 --env SAVE_ITER=5

Monitor and manage it:

sky jobs queue            # status of all managed jobs
sky jobs logs -n cosmos3  # stream logs
sky jobs cancel -n cosmos3

Tunable knobs (`--env`)#

Env var	Default	Meaning
`DATASET_PATH`	bridge subset	Dataset dir the launcher trains on (override for your own data).
`MAX_ITER`	`500`	Number of optimizer steps (set small for a smoke test).
`SAVE_ITER`	`100`	Save a DCP checkpoint every N steps.
`EXPORT_SAFETENSORS`	`1`	Export the trained checkpoint to HF safetensors.
`COSMOS_FRAMEWORK_REF`	pinned commit	cosmos-framework git ref to install.

Outputs#

Checkpoints (checkpoints/iter_<N>/), the resolved config.yaml, and the exported safetensors (model/) land on the cosmos3-checkpoints volume under cosmos3/sft/vision_sft_nano/. The volume persists after the job finishes — inspect it with sky volumes ls, or mount it from another SkyPilot task (e.g. a serving job) with a volumes: block to read the exported model.

Bring your own dataset#

Put your data on a second volume and point the recipe at it. Create the volume (laid out with train/video_dataset_file.jsonl; see the dataset docs), then in cosmos3_nano_finetune.yaml uncomment the dataset mount under volumes::

volumes:
  /checkpoints: cosmos3-checkpoints
  /my-dataset: my-dataset-volume

and launch with --env DATASET_PATH=/my-dataset.

Using a cloud bucket instead of a volume#

Most of this example is Kubernetes + volume centric, but nothing requires it. To run on a cloud (or to keep checkpoints in object storage for cross-region access), drop the volumes: block in cosmos3_nano_finetune.yaml and mount a bucket at /checkpoints instead:

file_mounts:
  /checkpoints:
    name: my-cosmos3-checkpoints  # globally-unique bucket name; SkyPilot creates it
    mode: MOUNT

Then remove infra: kubernetes (or set --infra <cloud>) and SkyPilot picks the cheapest cloud/region with 8× H100/H200. Add --use-spot for cheaper preemptible GPUs — the job auto-resumes from the bucket after a preemption. See Cloud Buckets.

References#

Cosmos 3 blog: https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/
Technical report: https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf
cosmos-framework: NVIDIA/cosmos-framework
NVIDIA Cosmos: NVIDIA/Cosmos
SkyPilot Volumes: https://docs.skypilot.co/en/stable/reference/volumes.html

Included files#

cosmos3_checkpoints_volume.yaml

# A persistent SkyPilot volume (Kubernetes PVC) for Cosmos3 checkpoints + exports.
#
# Create it once, then the fine-tuning job mounts it at /checkpoints so the
# managed job auto-resumes from the latest checkpoint after a recovery and the
# outputs outlive the job's cluster:
#
#   sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
#
# See https://docs.skypilot.co/en/stable/reference/volumes.html

name: cosmos3-checkpoints
type: k8s-pvc
infra: k8s  # or k8s/<context>
size: 1000Gi

config:
  # ReadWriteMany lets the volume be reattached across managed-job recoveries
  # (and shared by all nodes if you scale to multi-node). It needs a storage class
  # that supports RWX (e.g. a shared filesystem such as JuiceFS, Nebius shared FS,
  # AWS EFS, or GCP Filestore) -- set storage_class_name to one below. If your
  # cluster only has ReadWriteOnce block storage, change access_mode to
  # ReadWriteOnce (this single-node job works fine with it).
  access_mode: ReadWriteMany
  # storage_class_name: csi-mounted-fs-path-sc  # omit to use the default StorageClass

cosmos3_nano_finetune.yaml

# Fine-tune NVIDIA Cosmos3-Nano (16B omnimodal world model) as a SkyPilot managed job.
#
# Runs NVIDIA's official `vision_sft_nano` SFT recipe from
# github.com/NVIDIA/cosmos-framework: an 8-GPU FSDP fine-tune of the Cosmos3-Nano
# generation pathway on the public nvidia/bridge-v2-subset-synthetic-captions
# robot-video dataset, then exports to Hugging Face safetensors. See README.md.
#
# Checkpoints are written to a mounted SkyPilot volume (a Kubernetes PVC), so the
# managed job auto-resumes from the latest checkpoint after a recovery, and the
# outputs survive job teardown.
#
# Usage (create the checkpoint volume once, then launch):
#   sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
#   sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml
#   # Quick smoke test (a few steps, still checkpoints + exports):
#   sky jobs launch -n cosmos3 examples/cosmos3-finetuning/cosmos3_nano_finetune.yaml \
#       --env MAX_ITER=10 --env SAVE_ITER=5

name: cosmos3-nano-finetune

resources:
  # Run on Kubernetes. Drop this line to let SkyPilot pick any infra (e.g. a cloud),
  # or set `infra: k8s/<context>` to pin a specific cluster.
  infra: kubernetes
  # Cosmos3 needs Ampere or newer; NVIDIA's recipe is tested on 8x H100 80GB.
  accelerators: {H100:8, H200:8}
  # NVIDIA's recommended CUDA 13 base image; the training env layers on with `uv sync`.
  image_id: docker:nvcr.io/nvidia/pytorch:25.09-py3
  disk_size: 1000

num_nodes: 1

volumes:
  # Durable checkpoint store, mounted at /checkpoints (the recipe's OUTPUT_ROOT).
  # DCP checkpoints, the resolved config, and the exported safetensors all land on
  # this volume, so they survive managed-job recovery (auto-resume) and post-job
  # teardown. Create it first with:
  #   sky volumes apply examples/cosmos3-finetuning/cosmos3_checkpoints_volume.yaml
  /checkpoints: cosmos3-checkpoints

  # Bring your own dataset: mount a second volume holding your data, then point the
  # recipe at it with `--env DATASET_PATH=/my-dataset` (that dir must contain
  # `train/video_dataset_file.jsonl`; see README), so it trains on your data instead
  # of the bridge subset `setup` downloads by default. Uncomment and set your volume:
  # /my-dataset: my-dataset-volume

envs:
  COSMOS_FRAMEWORK_REF: 411d25b2e35bc441126f48c44a4b93e1c0564274  # pinned for reproducibility
  BASE_CHECKPOINT_NAME: Cosmos3-Nano  # cosmos-framework catalog name -> nvidia/Cosmos3-Nano on HF
  # Dataset dir the launcher trains on. Defaults to the bridge subset downloaded in
  # `setup`; override to fine-tune on your own data (see the bring-your-own-dataset
  # note under volumes above).
  DATASET_PATH: examples/data/bridge-v2-subset-synthetic-captions/sft_dataset_bridge
  MAX_ITER: "500"          # optimizer steps (recipe default 500; set small for a smoke test)
  SAVE_ITER: "100"         # save a DCP checkpoint every N steps
  EXPORT_SAFETENSORS: "1"  # export the trained checkpoint to HF safetensors (1=yes, 0=no)

secrets:
  # Base model + dataset are public, so this defaults to empty (no token needed).
  # To authenticate, `export HF_TOKEN=...` and pass `--secret HF_TOKEN`.
  HF_TOKEN: ""

# Prefer a cloud bucket over a volume (e.g. for cross-region access)? Drop the
# `volumes:` block above and mount a bucket at /checkpoints instead:
#   file_mounts:
#     /checkpoints:
#       name: my-cosmos3-checkpoints  # globally-unique bucket name; SkyPilot creates it
#       mode: MOUNT
# See https://docs.skypilot.co/en/latest/reference/storage.html

setup: |
  set -e
  # The NGC image auto-activates a conda base env; drop it so uv's venv is the only one.
  conda deactivate 2>/dev/null || true

  export DEBIAN_FRONTEND=noninteractive
  SUDO=""; [ "$(id -u)" -ne 0 ] && SUDO="sudo"
  $SUDO apt-get update -y
  $SUDO apt-get install -y --no-install-recommends curl ffmpeg git git-lfs libx11-dev wget

  if ! command -v uv >/dev/null 2>&1; then
    curl -LsSf https://astral.sh/uv/install.sh | sh
  fi
  source "$HOME/.local/bin/env" 2>/dev/null || export PATH="$HOME/.local/bin:$PATH"

  # Clone cosmos-framework at the pinned commit (skip large LFS assets).
  cd "$HOME"
  if [ ! -d cosmos-framework/.git ]; then
    rm -rf cosmos-framework
    GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/NVIDIA/cosmos-framework.git
  fi
  cd cosmos-framework
  git fetch origin "$COSMOS_FRAMEWORK_REF" 2>/dev/null || git fetch origin
  git checkout "$COSMOS_FRAMEWORK_REF"

  # Install the CUDA 13 training environment.
  uv python install
  uv sync --all-extras --group=cu130-train
  source .venv/bin/activate
  export LD_LIBRARY_PATH=  # keep host CUDA libs out of the venv's torch
  [ -n "${HF_TOKEN:-}" ] && export HF_TOKEN

  # Download the dataset + Wan2.2 VAE into the launcher's default locations.
  uvx hf@latest download --repo-type dataset nvidia/bridge-v2-subset-synthetic-captions \
      --revision 46468e12ac0dd36901e9e3240d4fc7620942b5d7 \
      --local-dir examples/data/bridge-v2-subset-synthetic-captions --quiet
  uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \
      --local-dir examples/checkpoints/wan22_vae --quiet

  # Download Cosmos3-Nano and convert it to PyTorch Distributed Checkpoint (DCP) format.
  if [ ! -d "examples/checkpoints/${BASE_CHECKPOINT_NAME}" ]; then
    python -m cosmos_framework.scripts.convert_model_to_dcp \
      -o "examples/checkpoints/${BASE_CHECKPOINT_NAME}" \
      --checkpoint-path "${BASE_CHECKPOINT_NAME}"
  fi

run: |
  set -e
  conda deactivate 2>/dev/null || true
  cd "$HOME/cosmos-framework"
  source .venv/bin/activate
  export LD_LIBRARY_PATH=
  [ -n "${HF_TOKEN:-}" ] && export HF_TOKEN
  export NPROC_PER_NODE="${SKYPILOT_NUM_GPUS_PER_NODE}"  # FSDP shards across every GPU

  # Write all training outputs to the mounted checkpoint volume so the recipe
  # auto-resumes from the latest checkpoint after a managed-job recovery.
  export OUTPUT_ROOT=/checkpoints
  RUN_DIR="$OUTPUT_ROOT/cosmos3/sft/vision_sft_nano"

  # Set step count + checkpoint frequency in the recipe TOML.
  TOML=examples/toml/sft_config/vision_sft_nano.toml
  sed -i "s/^[[:space:]]*max_iter[[:space:]]*=.*/max_iter = ${MAX_ITER}/" "$TOML"
  sed -i "s/^[[:space:]]*save_iter[[:space:]]*=.*/save_iter = ${SAVE_ITER}/" "$TOML"

  # Launch multi-GPU FSDP supervised fine-tuning.
  bash examples/launch_sft_vision_nano.sh

  # Export the fine-tuned DCP checkpoint to Hugging Face safetensors.
  if [ "${EXPORT_SAFETENSORS}" = "1" ] && [ -f "$RUN_DIR/checkpoints/latest_checkpoint.txt" ]; then
    CKPT_ITER="$(cat "$RUN_DIR/checkpoints/latest_checkpoint.txt")"
    python -m cosmos_framework.scripts.export_model \
      --checkpoint-path "$RUN_DIR/checkpoints/$CKPT_ITER" \
      --config-file "$RUN_DIR/config.yaml" \
      -o "$RUN_DIR/model"
    echo ">>> Fine-tuned safetensors written to: $RUN_DIR/model"
  fi