Source: examples/sam3-video-segmentation
Scaling Video Segmentation with SAM3 and SkyPilot Pools#
This example demonstrates how to use SAM3 (Segment Anything 3) with SkyPilot’s pools feature to process a soccer video dataset in parallel across multiple GPU workers.
SAM3 is Meta’s unified foundation model for promptable segmentation in images and videos. It can:
Detect, segment, and track objects using text or visual prompts
Handle open-vocabulary concepts specified by text phrases
Process videos with state-of-the-art accuracy
Prerequisites#
Kaggle API credentials (
~/.kaggle/kaggle.json)S3 bucket for output storage
Quick start: Single-node testing#
For quick testing on a single node without pools, use sam3-test-single.yaml which combines setup and run in a single task:
sky launch -c sam3-test sam3-test-single.yaml \
--env OUTPUT_BUCKET_NAME=my-bucket --secret HF_TOKEN
Note: Processing the entire dataset on a single node will be slow. Use pools (below) for production workloads.
Scaling with pools#
A pool is a collection of GPU instances that share an identical setup—dependencies, models, and datasets are installed once and reused across all jobs. Instead of provisioning new machines for each job (with cold-start delays for downloading models and datasets), pools keep workers warm and ready to execute immediately.
Why use pools for video segmentation?
Eliminate cold starts: SAM3 model loading and dataset downloads happen once during pool creation, not per job
Parallel processing: Submit dozens of jobs at once; SkyPilot automatically distributes them across available workers
Dynamic scaling: Scale workers up or down with a single command based on your throughput needs
Efficient resource use: Workers are reused across jobs, avoiding repeated setup overhead
For more details, see the SkyPilot Pools documentation.

Step 1: Create the pool#
sky jobs pool apply -p sam3-pool sam3-pool.yaml --env OUTPUT_BUCKET_NAME=my-bucket
This spins up 3 GPU workers (workers: 3) with SAM3 and the dataset pre-loaded.
Step 2: Check pool status#
sky jobs pool status sam3-pool
Wait for all workers to show READY status.
Step 3: Submit batch jobs#
sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml
This submits 10 parallel jobs to process the entire dataset. Three will start immediately (one per worker), and the rest will queue up.
Step 4: Monitor progress#
View the dashboard:
sky dashboard
The dashboard shows pool workers and their status:

Check job queue:
sky jobs queue
The jobs queue shows completed, running, and pending jobs:

View logs:
sky jobs logs <job-id>
...
(sam3-segmentation-job, pid=3213) Model loaded!
(sam3-segmentation-job, pid=3213) Processing: 87
(sam3-segmentation-job, pid=3213) 50 frames (sampled at 1 fps from 25.0 fps)
(sam3-segmentation-job, pid=3213) 0%| | 0/50 [00:00<?, ?it/s]kernels library is not installed. NMS post-processing, hole filling, and sprinkle removal will be skipped. Install it with `pip install kernels` for better mask quality.
100%|██████████| 50/50 [00:48<00:00, 1.03it/s]█▊| 49/50 [00:46<00:01, 1.09s/it]
...
Step 5: Scale as needed#
To process faster, scale up the pool:
sky jobs pool apply --pool sam3-pool --workers 10
sky jobs launch --pool sam3-pool --num-jobs 20 sam3-job.yaml
Step 6: Cleanup#
When done, tear down the pool:
sky jobs pool down sam3-pool
How it works#
Pool configuration (sam3-pool.yaml)#
The pool YAML defines the worker infrastructure:
Workers: Number of GPU instances
Resources: L40S GPU per worker
File mounts: Kaggle credentials and S3 output bucket
Setup: Runs once per worker to install dependencies and download the dataset
Job configuration (sam3-job.yaml)#
The job YAML defines the workload:
Resources: Must match pool resources (L40S GPU)
Run: Processes assigned chunk of videos on each job
Work distribution#
SkyPilot automatically distributes work using environment variables:
$SKYPILOT_JOB_RANK: Current job index (0, 1, 2, …)$SKYPILOT_NUM_JOBS: Total number of jobs
The bash script in the run section calculates which videos each job should process based on these variables.
Segmentation process#
The process_segmentation.py script:
Loads SAM3 model from Hugging Face
Processes each video frame-by-frame
Uses text prompts (“soccer player”, “ball”) to detect and segment objects
Overlays colored masks on video frames
Saves segmented videos and metadata to S3

Output#
Results are synced to the S3 bucket specified via OUTPUT_BUCKET_NAME:
$ aws s3 ls s3://my-bucket/segmentation_results/ --recursive
2025-12-22 08:53:37 0 segmentation_results/
2025-12-22 08:54:22 0 segmentation_results/1/
2025-12-22 08:54:23 231 segmentation_results/1/1_metadata.json
2025-12-22 08:54:23 3041504 segmentation_results/1/1_segmented.mp4
2025-12-22 08:55:13 0 segmentation_results/10/
2025-12-22 08:55:13 234 segmentation_results/10/10_metadata.json
2025-12-22 08:55:13 4291581 segmentation_results/10/10_segmented.mp4
2025-12-22 08:56:12 0 segmentation_results/100/
2025-12-22 08:56:13 237 segmentation_results/100/100_metadata.json
2025-12-22 08:56:13 4232746 segmentation_results/100/100_segmented.mp4
...
Each metadata JSON contains:
Number of frames processed
Objects detected (players, balls)
Output video path
Customization#
Adjust sample rate#
By default, the script samples 1 frame per second. To change this, use the --sample-fps argument:
# Sample 2 frames per second
python process_segmentation.py video.mp4 --sample-fps 2
# Process all frames (use 0 or negative value)
python process_segmentation.py video.mp4 --sample-fps 0
Limit frames per video#
By default, all sampled frames are processed. To limit this (useful for long videos or to avoid OOM), use the --max-frames argument:
# Process up to 200 frames per video
python process_segmentation.py video.mp4 --max-frames 200
Change text prompts#
Edit the PROMPTS list in process_segmentation.py:
PROMPTS = ["person", "ball", "goal", "referee"]
Use different GPU#
Update sam3-pool.yaml and sam3-job.yaml to use a different accelerator:
resources:
accelerators: H100:1
References#
Included files#
process_segmentation.py
"""SAM3 video segmentation for soccer players and ball."""
import argparse
import gc
import json
from pathlib import Path
import shutil
import tempfile
import cv2
import numpy as np
from PIL import Image
import torch
from transformers import Sam3VideoModel
from transformers import Sam3VideoProcessor
PROMPTS = ["soccer player", "ball"]
PLAYER_COLOR = (255, 100, 100)
BALL_COLOR = (100, 255, 100)
def load_video_frames(video_path, sample_fps=1, max_frames=0):
"""Extract frames from video at given sample rate."""
cap = cv2.VideoCapture(video_path)
original_fps = cap.get(cv2.CAP_PROP_FPS)
if sample_fps <= 0 or sample_fps >= original_fps:
frame_interval = 1
output_fps = original_fps
else:
frame_interval = int(original_fps / sample_fps)
output_fps = sample_fps
frames = []
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
frames.append(
Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
if max_frames > 0 and len(frames) >= max_frames:
break
frame_count += 1
cap.release()
return frames, original_fps, output_fps
def overlay_masks(frame, masks, colors, alpha=0.5):
"""Blend segmentation masks onto frame."""
base = np.array(frame, dtype=np.float32) / 255.0
overlay = base.copy()
for obj_id, mask in masks.items():
if mask is None:
continue
mask = np.squeeze(mask).clip(0, 1).astype(np.float32)
color = np.array(colors.get(obj_id,
(255, 0, 0)), dtype=np.float32) / 255.0
m = mask[..., None]
overlay = overlay * (1 - alpha * m) + color * (alpha * m)
return Image.fromarray((overlay * 255).clip(0, 255).astype(np.uint8))
def save_video(frames, output_path, fps):
"""Write frames to video file."""
if not frames:
return
h, w = np.array(frames[0]).shape[:2]
out = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'), fps,
(w, h))
for frame in frames:
out.write(cv2.cvtColor(np.array(frame), cv2.COLOR_RGB2BGR))
out.release()
def process_video(model,
processor,
video_path,
output_dir,
sample_fps=1,
max_frames=0):
"""Run SAM3 segmentation on video and save results."""
video_name = Path(video_path).stem
print(f"Processing: {video_name}")
frames, original_fps, output_fps = load_video_frames(
video_path, sample_fps, max_frames)
if not frames:
return {"video": video_name, "error": "Could not load video frames"}
print(
f" {len(frames)} frames (sampled at {output_fps} fps from {original_fps} fps)"
)
session = processor.init_video_session(
video=frames,
inference_device="cuda",
processing_device="cpu",
video_storage_device="cpu",
dtype=torch.bfloat16,
)
session = processor.add_text_prompt(inference_session=session, text=PROMPTS)
masks_by_frame = {}
obj_to_prompt = {}
with torch.no_grad():
for out in model.propagate_in_video_iterator(
inference_session=session, max_frame_num_to_track=len(frames)):
processed = processor.postprocess_outputs(session, out)
frame_idx = out.frame_idx
for prompt, ids in processed.get("prompt_to_obj_ids", {}).items():
for obj_id in ids:
obj_to_prompt[int(obj_id)] = prompt
frame_masks = {}
for i, obj_id in enumerate(processed["object_ids"]):
mask = processed["masks"][i].float().cpu().numpy()
frame_masks[int(obj_id.item())] = (np.squeeze(mask) > 0).astype(
np.float32)
masks_by_frame[frame_idx] = frame_masks
colors = {}
for obj_id, prompt in obj_to_prompt.items():
if "player" in prompt.lower():
colors[obj_id] = PLAYER_COLOR
elif "ball" in prompt.lower():
colors[obj_id] = BALL_COLOR
output_frames = []
for i, frame in enumerate(frames):
masks = masks_by_frame.get(i, {})
output_frames.append(
overlay_masks(frame, masks, colors) if masks else frame)
# Write to temp file first (cv2.VideoWriter doesn't work well with FUSE mounts)
video_output_dir = output_dir / video_name
video_output_dir.mkdir(parents=True, exist_ok=True)
output_video_path = video_output_dir / f"{video_name}_segmented.mp4"
with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as tmp:
tmp_path = tmp.name
try:
save_video(output_frames, tmp_path, output_fps or 1.0)
shutil.copy2(tmp_path, str(output_video_path))
finally:
Path(tmp_path).unlink(missing_ok=True)
prompts_lower = [p.lower() for p in obj_to_prompt.values()]
total_players = sum("player" in p for p in prompts_lower)
total_balls = sum("ball" in p for p in prompts_lower)
result = {
"video": video_name,
"frames_processed": len(frames),
"original_fps": original_fps,
"output_fps": output_fps,
"objects_detected": len(obj_to_prompt),
"players_detected": total_players,
"balls_detected": total_balls,
"output_video": str(output_video_path),
}
with open(video_output_dir / f"{video_name}_metadata.json", 'w') as f:
json.dump(result, f, indent=2)
print(f" Detected {total_players} player(s), {total_balls} ball(s)")
print(f" Saved to {output_video_path}")
return result
def main():
parser = argparse.ArgumentParser(description='SAM3 video segmentation')
parser.add_argument('video_path', help='Input video file')
parser.add_argument('--output-dir', default='/outputs/segmentation_results')
parser.add_argument('--sample-fps',
type=float,
default=1,
help='Sample rate (0=all frames)')
parser.add_argument('--max-frames',
type=int,
default=0,
help='Max frames (0=unlimited)')
args = parser.parse_args()
video_path = Path(args.video_path)
if not video_path.exists():
print(f"Error: Video not found: {video_path}")
return 1
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Video: {video_path}")
print(f"Output: {output_dir}")
print(
f"Sample FPS: {args.sample_fps}, Max frames: {args.max_frames or 'unlimited'}"
)
print("\nLoading SAM3 model...")
model = Sam3VideoModel.from_pretrained("facebook/sam3").to(
"cuda", dtype=torch.bfloat16).eval()
processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
print("Model loaded!")
try:
result = process_video(model, processor, str(video_path), output_dir,
args.sample_fps, args.max_frames)
if "error" in result:
print(f"Error: {result['error']}")
return 1
print("\nDone!")
return 0
except Exception as e:
print(f"Error: {e}")
return 1
finally:
gc.collect()
torch.cuda.empty_cache()
if __name__ == "__main__":
exit(main())
requirements.txt
--extra-index-url https://download.pytorch.org/whl/cu118
kaggle==1.6.3
accelerate==1.12.0
torch==2.6.0+cu118
torchvision==0.21.0+cu118
torchaudio==2.6.0+cu118
git+https://github.com/huggingface/transformers.git@c3fb1b1a6ca1102f62b139c83a088a97e5a55477
opencv-python==4.12.0.88
pillow==12.0.0
numpy==2.2.6
sam3-job.yaml
# Job configuration for SAM3 video segmentation.
# Processes a chunk of videos based on SKYPILOT_JOB_RANK.
#
# Usage (requires a pool to be created first with sam3-pool.yaml):
#
# sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml
#
# Each job processes a subset of videos based on its rank.
name: sam3-segmentation-job
resources:
accelerators: L40S:1
secrets:
HF_TOKEN: null
run: |
source .venv/bin/activate
echo "Job rank: ${SKYPILOT_JOB_RANK}/${SKYPILOT_NUM_JOBS}"
# Get list of all videos
VIDEO_DIR=/outputs/datasets/soccer-videos
mapfile -t VIDEOS < <(find ${VIDEO_DIR} -name "*.mp4" | sort)
TOTAL_VIDEOS=${#VIDEOS[@]}
echo "Total videos: ${TOTAL_VIDEOS}"
# Calculate start and end indices for this job
CHUNK_SIZE=$((TOTAL_VIDEOS / SKYPILOT_NUM_JOBS))
REMAINDER=$((TOTAL_VIDEOS % SKYPILOT_NUM_JOBS))
START_IDX=$((SKYPILOT_JOB_RANK * CHUNK_SIZE))
if [ ${SKYPILOT_JOB_RANK} -lt ${REMAINDER} ]; then
START_IDX=$((START_IDX + SKYPILOT_JOB_RANK))
CHUNK_SIZE=$((CHUNK_SIZE + 1))
else
START_IDX=$((START_IDX + REMAINDER))
fi
END_IDX=$((START_IDX + CHUNK_SIZE))
echo "Processing videos ${START_IDX} to ${END_IDX}"
# Process each video in this job's chunk
for ((i=START_IDX; i<END_IDX; i++)); do
video="${VIDEOS[$i]}"
echo "Processing: $video"
python process_segmentation.py "$video" --max-frames 50 || echo "Failed: $video"
done
echo "Job complete! Results saved to S3 bucket."
sam3-pool.yaml
# Pool configuration for SAM3 video segmentation workers.
# Creates GPU workers with pre-loaded dependencies and datasets.
#
# Usage:
#
# sky jobs pool apply -p sam3-pool sam3-pool.yaml --env OUTPUT_BUCKET_NAME=my-bucket
#
# Then submit jobs with:
#
# sky jobs launch --pool sam3-pool --num-jobs 10 --secret HF_TOKEN sam3-job.yaml
pool:
workers: 3
resources:
accelerators: L40S:1
envs:
OUTPUT_BUCKET_NAME: # S3 bucket for storing datasets and results
file_mounts:
~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
/outputs:
name: $OUTPUT_BUCKET_NAME
mode: MOUNT
workdir: .
setup: |
# Setup runs once on all workers (must be non-blocking)
sudo apt-get update && sudo apt-get install -y unzip ffmpeg
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
# Download soccer video dataset from Kaggle (store in S3 to avoid re-downloading)
DATASET_PATH=/outputs/datasets/soccer-videos
if [ ! -d "$DATASET_PATH" ]; then
echo "Downloading dataset from Kaggle to S3..."
mkdir -p /outputs/datasets
kaggle datasets download shreyamainkar/football-soccer-videos-dataset --force
unzip -q football-soccer-videos-dataset.zip -d $DATASET_PATH
rm -f football-soccer-videos-dataset.zip
fi
echo "Setup complete!"
sam3-test-single.yaml
# Single-node SAM3 video segmentation for testing.
# Combines setup and run in a single task without using pools.
#
# Usage:
#
# sky launch -c sam3-test sam3-test-single.yaml \
# --env OUTPUT_BUCKET_NAME=my-bucket --secret HF_TOKEN
#
# For production workloads, use pools (sam3-pool.yaml) instead.
resources:
accelerators: L40S:1
envs:
OUTPUT_BUCKET_NAME: # S3 bucket for storing datasets and results
file_mounts:
~/.kaggle/kaggle.json: ~/.kaggle/kaggle.json
/outputs:
name: $OUTPUT_BUCKET_NAME
mode: MOUNT
secrets:
HF_TOKEN: null
workdir: .
setup: |
# Same setup as sam3-pool.yaml
sudo apt-get update && sudo apt-get install -y unzip ffmpeg
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
# Download soccer video dataset from Kaggle (store in S3 to avoid re-downloading)
DATASET_PATH=/outputs/datasets/soccer-videos
if [ ! -d "$DATASET_PATH" ]; then
echo "Downloading dataset from Kaggle to S3..."
mkdir -p /outputs/datasets
kaggle datasets download shreyamainkar/football-soccer-videos-dataset --force
unzip -q football-soccer-videos-dataset.zip -d $DATASET_PATH
rm -f football-soccer-videos-dataset.zip
fi
echo "Setup complete!"
run: |
source .venv/bin/activate
# Process all videos on a single node
for video in /outputs/datasets/soccer-videos/*.mp4; do
echo "Processing: $video"
python process_segmentation.py "$video" --max-frames 50 || echo "Failed: $video"
done
echo "All videos processed!"