Source: examples/autoresearch

Parallel Autoresearch with SkyPilot#

Run karpathy/autoresearch experiments in parallel on cloud GPUs using the SkyPilot skill. A local coding agent uses the skill to spin up GPU clusters, submit experiments, and parallelize work across multiple clusters.

For a deep dive into methodology and results, check out the blog post.

Architecture#

Local Agent (Claude Code, Codex, etc.)
  |
  |-- uses the SkyPilot skill to:
  |     - launch GPU clusters
  |     - submit experiment jobs (from experiment.yaml)
  |     - check job status and stream logs
  |     - SSH into clusters to inspect results
  |     - tear down idle clusters
  |
  |-- edits train.py with hypotheses
  |-- tracks results in a local results.tsv

Agent runs locally, generates hypotheses, edits train.py
SkyPilot skill handles all infrastructure — the agent tells it what to do and the skill translates that into SkyPilot commands
SSH access — the agent SSHes into clusters to check experiment output directly

Files#

File	Purpose
`experiment.yaml`	SkyPilot task template: runs one experiment on a GPU cluster
`instructions.md`	Agent instructions for using SkyPilot (give this to your coding agent)

The original program.md from the autoresearch repo is fetched automatically during setup. It contains the experiment rules (what to modify, goals, constraints).

Prerequisites#

SkyPilot — Install following the SkyPilot installation guide.
Cloud credentials — Configure at least one cloud provider by running sky check. See the cloud setup docs for details.

Quick Start#

# 1. Clone autoresearch
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# 2. Copy SkyPilot files into the repo
cp /path/to/this/example/experiment.yaml .
cp /path/to/this/example/instructions.md .

# 3. Prepare data
pip install uv && uv sync && uv run prepare.py

# 4. Install the SkyPilot skill for your agent
#    See: https://docs.skypilot.co/en/latest/getting-started/skill.html

# 5. Tell your agent to start
#    "Read instructions.md and start running parallel experiments"

The SkyPilot skill handles installation, credential setup, and all cluster operations. If SkyPilot isn’t installed yet, the skill will guide through that too.

Results#

Wall-clock time comparison: parallel vs sequential
Parallel agent (16 GPUs) reaches the same best validation loss 9x faster than sequential (1 GPU).

In an 8-hour run across 16 GPUs, the parallel agent:

Submitted ~910 experiments (~90/hour vs ~10/hour sequential)
Achieved a 9x speedup to reach the same best validation loss
Improved val_bpb from 1.003 → 0.974 (2.87% improvement)
Cost: $9 in Claude API calls + $300 in GPU compute

Parallelism changes the agent’s search strategy. Instead of greedy sequential hill-climbing, it explores a grid — testing multiple hyperparameters simultaneously and cross-referencing results. The agent also independently discovered a two-tier strategy: screening hypotheses on cheaper H100s, then promoting winners to faster H200s, without being told to do so.

Included files#

experiment.yaml

resources:
  accelerators: {H100:1, H200:1}
  image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3

workdir: .

envs:
  EXPERIMENT_ID: baseline
  EXPERIMENT_DESC: "baseline run"

setup: |
  pip install uv
  uv sync
  uv run prepare.py

run: |
  # Run the experiment (5-min fixed budget)
  uv run train.py 2>&1 | tee run.log
  EXIT_CODE=${PIPESTATUS[0]}

  if [ $EXIT_CODE -ne 0 ]; then
    echo "EXPERIMENT_STATUS: crash"
    echo "Experiment ${EXPERIMENT_ID} crashed. Check logs above for details."
  else
    VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
    PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')

    if [ -z "$VAL_BPB" ] || [ -z "$PEAK_VRAM" ]; then
      echo "EXPERIMENT_STATUS: crash"
      echo "Experiment ${EXPERIMENT_ID} finished but failed to parse metrics from run.log."
    else
      MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
      echo "EXPERIMENT_STATUS: done"
      echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
    fi
  fi

  echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"

instructions.md

Parallel Autoresearch with SkyPilot

You are an autonomous research agent running parallel GPU experiments via SkyPilot.

Setup

Read the autoresearch rules: Fetch and read program.md from the original repo. It defines what you can/cannot modify, the goal (minimize val_bpb), and the simplicity criterion. Follow those rules.
Load the SkyPilot skill: Fetch and follow the SkyPilot skill — run its “Before You Start” bootstrap to confirm SkyPilot is installed and credentials are configured.
Read the autoresearch codebase: Read README.md, prepare.py, and train.py for full context.
Ask about infra preference: Ask if the user prefers a specific cloud (e.g. --infra nebius, --infra kubernetes). If so, set infra: in the YAML. Otherwise SkyPilot picks the cheapest option.

Launching Experiments

Use the SkyPilot skill for all infrastructure operations. The template experiment.yaml defines a single experiment run. Name clusters gpu-01, gpu-02, etc. — each cluster can run multiple experiments over time.

Launch a cluster:

sky launch gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-01 --env EXPERIMENT_DESC="baseline" -d -y

Pipeline experiments on the same cluster (back-to-back via the job queue):

sky exec gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-02 --env EXPERIMENT_DESC="increase LR" -d

Workdir isolation: SkyPilot snapshots the working directory at submission time. To run different train.py variants in parallel, copy files to a per-experiment folder and use --workdir:

mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# edit /tmp/autoresearch/exp-03/train.py
sky launch gpu-03 experiment.yaml --workdir /tmp/autoresearch/exp-03 --env EXPERIMENT_ID=exp-03 --env EXPERIMENT_DESC="wider model" -d -y

Keep at most 4 clusters running at a time.

Checking Results

Use sky logs to stream job output:

sky logs gpu-01      # latest job
sky logs gpu-01 2    # specific job ID

Or SSH in and inspect directly (workdir syncs to ~/sky_workdir):

ssh gpu-01
cd ~/sky_workdir
tail -20 run.log

Check status:

sky status           # all clusters
sky queue gpu-01     # jobs on a specific cluster

Tracking Results

Maintain a local results.tsv (tab-separated):

experiment_id	status	val_bpb	memory_gb	description
exp-01	keep	0.997900	44.0	baseline
exp-02	discard	1.005000	44.0	switch to GeLU
exp-03	crash	0.000000	0.0	double width (OOM)

Status: keep (improvement), discard (no improvement), crash (failed).

The Experiment Loop

LOOP FOREVER:

Check state: Review results.tsv, sky status, sky queue.
Pick an untried idea.
Prepare: Copy code to a per-job folder, edit train.py.
Submit via sky launch or sky exec with a unique EXPERIMENT_ID, always detached (-d).
Don’t wait — move on to the next idea.
Periodically check results via sky logs or SSH.
- val_bpb improved → copy winning train.py back, commit.
- Otherwise → log as discard.
Tear down idle clusters: sky down gpu-01 -y
Repeat.

Timeout: If a run exceeds 10 minutes, treat as failure. Crashes: Check logs, fix trivial issues and resubmit, or log as crash.

NEVER STOP: Do NOT pause to ask the human if you should continue. Work indefinitely until manually stopped. If stuck, re-read the code, combine near-misses, try radical changes.

Cleanup

sky down -a -y

setup.sh

#!/usr/bin/env bash
set -euo pipefail

AUTORESEARCH_DIR="autoresearch"
EXAMPLES_BASE="https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autoresearch"

echo "=== Autoresearch + SkyPilot setup ==="
echo ""

# 1. Install uv if missing
if ! command -v uv &>/dev/null; then
    echo "[1/4] Installing uv..."
    curl -LsSf https://astral.sh/uv/install.sh | sh
    # uv's installer drops the binary here; make it available for the rest of this script
    export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
    echo "[1/4] uv already installed ($(uv --version))"
fi

# 2. Install SkyPilot if missing
if ! command -v sky &>/dev/null; then
    echo "[2/4] Installing SkyPilot via uv..."
    uv tool install skypilot
    export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
    echo "[2/4] SkyPilot already installed ($(sky --version 2>/dev/null || echo 'unknown version'))"
fi

# 3. Clone autoresearch
echo "[3/4] Cloning karpathy/autoresearch..."
if [ -d "$AUTORESEARCH_DIR" ]; then
    echo "      Directory '$AUTORESEARCH_DIR' already exists, skipping clone."
else
    git clone https://github.com/karpathy/autoresearch.git "$AUTORESEARCH_DIR"
fi

# 4. Download SkyPilot files into the cloned repo
#    (dependency installation and data prep are handled by experiment.yaml on the remote cluster)
echo "[4/4] Downloading experiment.yaml and instructions.md..."
curl -fsSL "$EXAMPLES_BASE/experiment.yaml" -o "$AUTORESEARCH_DIR/experiment.yaml"
curl -fsSL "$EXAMPLES_BASE/instructions.md"  -o "$AUTORESEARCH_DIR/instructions.md"

echo ""
echo "=== Credential check ==="
# Capture sky check output without letting its exit code kill the script,
# then inspect the text. (sky check exits non-zero when some clouds are absent,
# which would otherwise trip set -e / pipefail on the pipe.)
SKY_CHECK=$(sky check 2>&1 || true)
if echo "$SKY_CHECK" | grep -q ": enabled"; then
    echo "Cloud credentials OK."
else
    echo ""
    echo "No cloud credentials detected. Run 'sky check' to configure access."
    echo "Docs: https://docs.skypilot.co/en/latest/getting-started/installation.html#cloud-account-setup"
    echo ""
    echo "You can still continue - SkyPilot will prompt you when you launch the first job."
fi

echo ""
echo "================================================================"
echo " Setup complete!"
echo "================================================================"
echo ""
echo " Change into the working directory first:"
echo ""
echo "   cd $AUTORESEARCH_DIR"
echo ""
echo " Open Claude Code/Codex/any agent in this directory, then paste this prompt:"
echo ""
echo "   Read instructions.md and start running parallel experiments."
echo ""
echo " To target a specific cloud, tell your agent:"
echo "   '...set infra to aws'  (or gcp, azure, kubernetes, etc.)"
echo ""
echo " Monitor clusters: sky status"
echo " Stream logs:      sky logs <cluster-name>"
echo " Tear down all:    sky down -a -y"
echo "================================================================"