Source: examples/sam3-video-segmentation

Scaling Video Segmentation with SAM3 and SkyPilot Pools#

This example demonstrates how to use SAM3 (Segment Anything 3) with SkyPilot’s pools feature to process a soccer video dataset in parallel across multiple GPU workers.

SAM3 is Meta’s unified foundation model for promptable segmentation in images and videos. It can:

Detect, segment, and track objects using text or visual prompts
Handle open-vocabulary concepts specified by text phrases
Process videos with state-of-the-art accuracy

Prerequisites#

Kaggle API credentials (~/.kaggle/kaggle.json)
S3 bucket for output storage

Quick start: Single-node testing#

For quick testing on a single node without pools, use sam3-test-single.yaml which combines setup and run in a single task:

sky launch -c sam3-test sam3-test-single.yaml \
  --env OUTPUT_BUCKET_NAME=my-bucket --secret HF_TOKEN

Note: Processing the entire dataset on a single node will be slow. Use pools (below) for production workloads.

Scaling with pools#

A pool is a collection of GPU instances that share an identical setup—dependencies, models, and datasets are installed once and reused across all jobs. Instead of provisioning new machines for each job (with cold-start delays for downloading models and datasets), pools keep workers warm and ready to execute immediately.

Why use pools for video segmentation?

Eliminate cold starts: SAM3 model loading and dataset downloads happen once during pool creation, not per job
Parallel processing: Submit dozens of jobs at once; SkyPilot automatically distributes them across available workers
Dynamic scaling: Scale workers up or down with a single command based on your throughput needs
Efficient resource use: Workers are reused across jobs, avoiding repeated setup overhead

For more details, see the SkyPilot Pools documentation.

SkyPilot Pools with SAM3 Video Segmentation

Step 1: Create the pool#

sky jobs pool apply -p sam3-pool sam3-pool.yaml --env OUTPUT_BUCKET_NAME=my-bucket

This spins up 3 GPU workers (workers: 3) with SAM3 and the dataset pre-loaded.

Step 2: Check pool status#

sky jobs pool status sam3-pool

Wait for all workers to show READY status.

Step 3: Submit batch jobs#

sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml

This submits 10 parallel jobs to process the entire dataset. Three will start immediately (one per worker), and the rest will queue up.

Step 4: Monitor progress#

View the dashboard:

sky dashboard

The dashboard shows pool workers and their status:

SkyPilot Dashboard Pool Workers

Check job queue:

sky jobs queue

The jobs queue shows completed, running, and pending jobs:

SkyPilot Dashboard Jobs Queue

View logs:

sky jobs logs <job-id>
...
(sam3-segmentation-job, pid=3213) Model loaded!
(sam3-segmentation-job, pid=3213) Processing: 87
(sam3-segmentation-job, pid=3213)   50 frames (sampled at 1 fps from 25.0 fps)
(sam3-segmentation-job, pid=3213)   0%|          | 0/50 [00:00<?, ?it/s]kernels library is not installed. NMS post-processing, hole filling, and sprinkle removal will be skipped. Install it with `pip install kernels` for better mask quality.
100%|██████████| 50/50 [00:48<00:00,  1.03it/s]█▊| 49/50 [00:46<00:01,  1.09s/it]
...

Step 5: Scale as needed#

To process faster, scale up the pool:

sky jobs pool apply --pool sam3-pool --workers 10
sky jobs launch --pool sam3-pool --num-jobs 20 sam3-job.yaml

Step 6: Cleanup#

When done, tear down the pool:

sky jobs pool down sam3-pool

How it works#

Pool configuration (`sam3-pool.yaml`)#

The pool YAML defines the worker infrastructure:

Workers: Number of GPU instances
Resources: L40S GPU per worker
File mounts: Kaggle credentials and S3 output bucket
Setup: Runs once per worker to install dependencies and download the dataset

Job configuration (`sam3-job.yaml`)#

The job YAML defines the workload:

Resources: Must match pool resources (L40S GPU)
Run: Processes assigned chunk of videos on each job

Work distribution#

SkyPilot automatically distributes work using environment variables:

$SKYPILOT_JOB_RANK: Current job index (0, 1, 2, …)
$SKYPILOT_NUM_JOBS: Total number of jobs

The bash script in the run section calculates which videos each job should process based on these variables.

Segmentation process#

The process_segmentation.py script:

Loads SAM3 model from Hugging Face
Processes each video frame-by-frame
Uses text prompts (“soccer player”, “ball”) to detect and segment objects
Overlays colored masks on video frames
Saves segmented videos and metadata to S3

Example Segmentation Output Example Segmentation Output 2

Output#

Results are synced to the S3 bucket specified via OUTPUT_BUCKET_NAME:

$ aws s3 ls s3://my-bucket/segmentation_results/ --recursive
2025-12-22 08:53:37          0 segmentation_results/
2025-12-22 08:54:22          0 segmentation_results/1/
2025-12-22 08:54:23        231 segmentation_results/1/1_metadata.json
2025-12-22 08:54:23    3041504 segmentation_results/1/1_segmented.mp4
2025-12-22 08:55:13          0 segmentation_results/10/
2025-12-22 08:55:13        234 segmentation_results/10/10_metadata.json
2025-12-22 08:55:13    4291581 segmentation_results/10/10_segmented.mp4
2025-12-22 08:56:12          0 segmentation_results/100/
2025-12-22 08:56:13        237 segmentation_results/100/100_metadata.json
2025-12-22 08:56:13    4232746 segmentation_results/100/100_segmented.mp4
...

Each metadata JSON contains:

Number of frames processed
Objects detected (players, balls)
Output video path

Customization#

Adjust sample rate#

By default, the script samples 1 frame per second. To change this, use the --sample-fps argument:

# Sample 2 frames per second
python process_segmentation.py video.mp4 --sample-fps 2

# Process all frames (use 0 or negative value)
python process_segmentation.py video.mp4 --sample-fps 0

Limit frames per video#

By default, all sampled frames are processed. To limit this (useful for long videos or to avoid OOM), use the --max-frames argument:

# Process up to 200 frames per video
python process_segmentation.py video.mp4 --max-frames 200

Change text prompts#

Edit the PROMPTS list in process_segmentation.py:

PROMPTS = ["person", "ball", "goal", "referee"]

Use different GPU#

Update sam3-pool.yaml and sam3-job.yaml to use a different accelerator:

resources:
  accelerators: H100:1

References#

Included files#

process_segmentation.py

"""SAM3 video segmentation for soccer players and ball."""

import argparse
import gc
import json
from pathlib import Path
import shutil
import tempfile

import cv2
import numpy as np
from PIL import Image
import torch
from transformers import Sam3VideoModel
from transformers import Sam3VideoProcessor

PROMPTS = ["soccer player", "ball"]
PLAYER_COLOR = (255, 100, 100)
BALL_COLOR = (100, 255, 100)


def load_video_frames(video_path, sample_fps=1, max_frames=0):
    """Extract frames from video at given sample rate."""
    cap = cv2.VideoCapture(video_path)
    original_fps = cap.get(cv2.CAP_PROP_FPS)

    if sample_fps <= 0 or sample_fps >= original_fps:
        frame_interval = 1
        output_fps = original_fps
    else:
        frame_interval = int(original_fps / sample_fps)
        output_fps = sample_fps

    frames = []
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frames.append(
                Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
            if max_frames > 0 and len(frames) >= max_frames:
                break
        frame_count += 1

    cap.release()
    return frames, original_fps, output_fps


def overlay_masks(frame, masks, colors, alpha=0.5):
    """Blend segmentation masks onto frame."""
    base = np.array(frame, dtype=np.float32) / 255.0
    overlay = base.copy()

    for obj_id, mask in masks.items():
        if mask is None:
            continue
        mask = np.squeeze(mask).clip(0, 1).astype(np.float32)
        color = np.array(colors.get(obj_id,
                                    (255, 0, 0)), dtype=np.float32) / 255.0
        m = mask[..., None]
        overlay = overlay * (1 - alpha * m) + color * (alpha * m)

    return Image.fromarray((overlay * 255).clip(0, 255).astype(np.uint8))


def save_video(frames, output_path, fps):
    """Write frames to video file."""
    if not frames:
        return
    h, w = np.array(frames[0]).shape[:2]
    out = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps,
                          (w, h))
    for frame in frames:
        out.write(cv2.cvtColor(np.array(frame), cv2.COLOR_RGB2BGR))
    out.release()


def process_video(model,
                  processor,
                  video_path,
                  output_dir,
                  sample_fps=1,
                  max_frames=0):
    """Run SAM3 segmentation on video and save results."""
    video_name = Path(video_path).stem
    print(f"Processing: {video_name}")

    frames, original_fps, output_fps = load_video_frames(
        video_path, sample_fps, max_frames)
    if not frames:
        return {"video": video_name, "error": "Could not load video frames"}

    print(
        f"  {len(frames)} frames (sampled at {output_fps} fps from {original_fps} fps)"
    )

    session = processor.init_video_session(
        video=frames,
        inference_device="cuda",
        processing_device="cpu",
        video_storage_device="cpu",
        dtype=torch.bfloat16,
    )
    session = processor.add_text_prompt(inference_session=session, text=PROMPTS)

    masks_by_frame = {}
    obj_to_prompt = {}

    with torch.no_grad():
        for out in model.propagate_in_video_iterator(
                inference_session=session, max_frame_num_to_track=len(frames)):
            processed = processor.postprocess_outputs(session, out)
            frame_idx = out.frame_idx

            for prompt, ids in processed.get("prompt_to_obj_ids", {}).items():
                for obj_id in ids:
                    obj_to_prompt[int(obj_id)] = prompt

            frame_masks = {}
            for i, obj_id in enumerate(processed["object_ids"]):
                mask = processed["masks"][i].float().cpu().numpy()
                frame_masks[int(obj_id.item())] = (np.squeeze(mask) > 0).astype(
                    np.float32)
            masks_by_frame[frame_idx] = frame_masks

    colors = {}
    for obj_id, prompt in obj_to_prompt.items():
        if "player" in prompt.lower():
            colors[obj_id] = PLAYER_COLOR
        elif "ball" in prompt.lower():
            colors[obj_id] = BALL_COLOR

    output_frames = []
    for i, frame in enumerate(frames):
        masks = masks_by_frame.get(i, {})
        output_frames.append(
            overlay_masks(frame, masks, colors) if masks else frame)

    # Write to temp file first (cv2.VideoWriter doesn't work well with FUSE mounts)
    video_output_dir = output_dir / video_name
    video_output_dir.mkdir(parents=True, exist_ok=True)
    output_video_path = video_output_dir / f"{video_name}_segmented.mp4"

    with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as tmp:
        tmp_path = tmp.name
    try:
        save_video(output_frames, tmp_path, output_fps or 1.0)
        shutil.copy2(tmp_path, str(output_video_path))
    finally:
        Path(tmp_path).unlink(missing_ok=True)

    prompts_lower = [p.lower() for p in obj_to_prompt.values()]
    total_players = sum("player" in p for p in prompts_lower)
    total_balls = sum("ball" in p for p in prompts_lower)

    result = {
        "video": video_name,
        "frames_processed": len(frames),
        "original_fps": original_fps,
        "output_fps": output_fps,
        "objects_detected": len(obj_to_prompt),
        "players_detected": total_players,
        "balls_detected": total_balls,
        "output_video": str(output_video_path),
    }

    with open(video_output_dir / f"{video_name}_metadata.json", 'w') as f:
        json.dump(result, f, indent=2)

    print(f"  Detected {total_players} player(s), {total_balls} ball(s)")
    print(f"  Saved to {output_video_path}")
    return result


def main():
    parser = argparse.ArgumentParser(description='SAM3 video segmentation')
    parser.add_argument('video_path', help='Input video file')
    parser.add_argument('--output-dir', default='/outputs/segmentation_results')
    parser.add_argument('--sample-fps',
                        type=float,
                        default=1,
                        help='Sample rate (0=all frames)')
    parser.add_argument('--max-frames',
                        type=int,
                        default=0,
                        help='Max frames (0=unlimited)')
    args = parser.parse_args()

    video_path = Path(args.video_path)
    if not video_path.exists():
        print(f"Error: Video not found: {video_path}")
        return 1

    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"Video: {video_path}")
    print(f"Output: {output_dir}")
    print(
        f"Sample FPS: {args.sample_fps}, Max frames: {args.max_frames or 'unlimited'}"
    )

    print("\nLoading SAM3 model...")
    model = Sam3VideoModel.from_pretrained("facebook/sam3").to(
        "cuda", dtype=torch.bfloat16).eval()
    processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
    print("Model loaded!")

    try:
        result = process_video(model, processor, str(video_path), output_dir,
                               args.sample_fps, args.max_frames)
        if "error" in result:
            print(f"Error: {result['error']}")
            return 1
        print("\nDone!")
        return 0
    except Exception as e:
        print(f"Error: {e}")
        return 1
    finally:
        gc.collect()
        torch.cuda.empty_cache()


if __name__ == "__main__":
    exit(main())

requirements.txt

--extra-index-url https://download.pytorch.org/whl/cu118

kaggle==1.6.3
accelerate==1.12.0
torch==2.6.0+cu118
torchvision==0.21.0+cu118
torchaudio==2.6.0+cu118
git+https://github.com/huggingface/transformers.git@c3fb1b1a6ca1102f62b139c83a088a97e5a55477
opencv-python==4.12.0.88
pillow==12.0.0
numpy==2.2.6

sam3-job.yaml

# Job configuration for SAM3 video segmentation.
# Processes a chunk of videos based on SKYPILOT_JOB_RANK.
#
# Usage (requires a pool to be created first with sam3-pool.yaml):
#
#  sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml
#
# Each job processes a subset of videos based on its rank.

name: sam3-segmentation-job

resources:
  accelerators: L40S:1

secrets:
  HF_TOKEN: null

run: |
  source .venv/bin/activate
  echo "Job rank: ${SKYPILOT_JOB_RANK}/${SKYPILOT_NUM_JOBS}"

  # Get list of all videos
  VIDEO_DIR=/outputs/datasets/soccer-videos
  mapfile -t VIDEOS < <(find ${VIDEO_DIR} -name "*.mp4" | sort)
  TOTAL_VIDEOS=${#VIDEOS[@]}
  echo "Total videos: ${TOTAL_VIDEOS}"

  # Calculate start and end indices for this job
  CHUNK_SIZE=$((TOTAL_VIDEOS / SKYPILOT_NUM_JOBS))
  REMAINDER=$((TOTAL_VIDEOS % SKYPILOT_NUM_JOBS))

  START_IDX=$((SKYPILOT_JOB_RANK * CHUNK_SIZE))
  if [ ${SKYPILOT_JOB_RANK} -lt ${REMAINDER} ]; then
    START_IDX=$((START_IDX + SKYPILOT_JOB_RANK))
    CHUNK_SIZE=$((CHUNK_SIZE + 1))
  else
    START_IDX=$((START_IDX + REMAINDER))
  fi

  END_IDX=$((START_IDX + CHUNK_SIZE))
  echo "Processing videos ${START_IDX} to ${END_IDX}"

  # Process each video in this job's chunk
  for ((i=START_IDX; i<END_IDX; i++)); do
    video="${VIDEOS[$i]}"
    echo "Processing: $video"
    python process_segmentation.py "$video" --max-frames 50 || echo "Failed: $video"
  done

  echo "Job complete! Results saved to S3 bucket."

sam3-pool.yaml

# Pool configuration for SAM3 video segmentation workers.
# Creates GPU workers with pre-loaded dependencies and datasets.
#
# Usage:
#
#  sky jobs pool apply -p sam3-pool sam3-pool.yaml --env OUTPUT_BUCKET_NAME=my-bucket
#
# Then submit jobs with:
#
#  sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml

pool:
  workers: 3

resources:
  accelerators: L40S:1

envs:
  OUTPUT_BUCKET_NAME:  # S3 bucket for storing datasets and results

file_mounts:
  ~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
  /outputs:
    name: $OUTPUT_BUCKET_NAME
    mode: MOUNT

workdir: .

setup: |
  # Setup runs once on all workers (must be non-blocking)
  sudo apt-get update && sudo apt-get install -y unzip ffmpeg
  uv venv .venv --python 3.12
  source .venv/bin/activate
  uv pip install -r requirements.txt
  # Download soccer video dataset from Kaggle (store in S3 to avoid re-downloading)
  DATASET_PATH=/outputs/datasets/soccer-videos
  if [ ! -d "$DATASET_PATH" ]; then
    echo "Downloading dataset from Kaggle to S3..."
    mkdir -p /outputs/datasets
    kaggle datasets download shreyamainkar/football-soccer-videos-dataset --force
    unzip -q football-soccer-videos-dataset.zip -d $DATASET_PATH
    rm -f football-soccer-videos-dataset.zip
  fi
  echo "Setup complete!"

sam3-test-single.yaml

# Single-node SAM3 video segmentation for testing.
# Combines setup and run in a single task without using pools.
#
# Usage:
#
#  sky launch -c sam3-test sam3-test-single.yaml \
#    --env OUTPUT_BUCKET_NAME=my-bucket --secret HF_TOKEN
#
# For production workloads, use pools (sam3-pool.yaml) instead.

resources:
  accelerators: L40S:1

envs:
  OUTPUT_BUCKET_NAME:  # S3 bucket for storing datasets and results

file_mounts:
  ~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
  /outputs:
    name: $OUTPUT_BUCKET_NAME
    mode: MOUNT

secrets:
  HF_TOKEN: null

workdir: .

setup: |
  # Same setup as sam3-pool.yaml
  sudo apt-get update && sudo apt-get install -y unzip ffmpeg
  uv venv .venv --python 3.12
  source .venv/bin/activate
  uv pip install -r requirements.txt
  # Download soccer video dataset from Kaggle (store in S3 to avoid re-downloading)
  DATASET_PATH=/outputs/datasets/soccer-videos
  if [ ! -d "$DATASET_PATH" ]; then
    echo "Downloading dataset from Kaggle to S3..."
    mkdir -p /outputs/datasets
    kaggle datasets download shreyamainkar/football-soccer-videos-dataset --force
    unzip -q football-soccer-videos-dataset.zip -d $DATASET_PATH
    rm -f football-soccer-videos-dataset.zip
  fi
  echo "Setup complete!"

run: |
  source .venv/bin/activate
  # Process all videos on a single node
  for video in /outputs/datasets/soccer-videos/*.mp4; do
    echo "Processing: $video"
    python process_segmentation.py "$video" --max-frames 50 || echo "Failed: $video"
  done
  echo "All videos processed!"