Source: examples/autoresearch

Parallel Autoresearch with SkyPilot#

Run karpathy/autoresearch experiments in parallel on cloud GPUs using the SkyPilot skill. A local coding agent uses the skill to spin up GPU clusters, submit experiments, and parallelize work across multiple clusters.

SkyPilot Dashboard#

SkyPilot Dashboard with 4 GPU clusters

Architecture#

Local Agent (Claude Code, Codex, etc.)
  |
  |-- uses the SkyPilot skill to:
  |     - launch GPU clusters
  |     - submit experiment jobs (from experiment.yaml)
  |     - check job status and stream logs
  |     - SSH into clusters to inspect results
  |     - tear down idle clusters
  |
  |-- edits train.py with hypotheses
  |-- tracks results in a local results.tsv

Agent runs locally, generates hypotheses, edits train.py
SkyPilot skill handles all infrastructure — the agent tells it what to do and the skill translates that into SkyPilot commands
SSH access — the agent SSHes into clusters to check experiment output directly

Files#

File	Purpose
`experiment.yaml`	SkyPilot task template: runs one experiment on a GPU cluster
`instructions.md`	Agent instructions for using SkyPilot (give this to your coding agent)

The original program.md from the autoresearch repo is fetched automatically during setup. It contains the experiment rules (what to modify, goals, constraints).

Prerequisites#

SkyPilot — Install following the SkyPilot installation guide.
Cloud credentials — Configure at least one cloud provider by running sky check. See the cloud setup docs for details.

Quick Start#

# 1. Clone autoresearch
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# 2. Copy SkyPilot files into the repo
cp /path/to/this/example/experiment.yaml .
cp /path/to/this/example/instructions.md .

# 3. Prepare data
pip install uv && uv sync && uv run prepare.py

# 4. Install the SkyPilot skill for your agent
#    See: https://docs.skypilot.co/en/latest/getting-started/skill.html

# 5. Tell your agent to start
#    "Read instructions.md and start running parallel experiments"

The SkyPilot skill handles installation, credential setup, and all cluster operations. If SkyPilot isn’t installed yet, the skill will guide through that too.

Included files#

experiment.yaml

resources:
  accelerators: {H100:1, H200:1}
  image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3

workdir: .

envs:
  EXPERIMENT_ID: baseline
  EXPERIMENT_DESC: "baseline run"

setup: |
  pip install uv
  uv sync
  uv run prepare.py

run: |
  # Run the experiment (5-min fixed budget)
  uv run train.py 2>&1 | tee run.log
  EXIT_CODE=${PIPESTATUS[0]}

  if [ $EXIT_CODE -ne 0 ]; then
    echo "EXPERIMENT_STATUS: crash"
    echo "Experiment ${EXPERIMENT_ID} crashed. Check logs above for details."
  else
    VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
    PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')

    if [ -z "$VAL_BPB" ] || [ -z "$PEAK_VRAM" ]; then
      echo "EXPERIMENT_STATUS: crash"
      echo "Experiment ${EXPERIMENT_ID} finished but failed to parse metrics from run.log."
    else
      MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
      echo "EXPERIMENT_STATUS: done"
      echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
    fi
  fi

  echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"

instructions.md

Parallel Autoresearch with SkyPilot

You are an autonomous research agent running parallel GPU experiments via SkyPilot.

Setup

Read the autoresearch rules: Fetch and read program.md from the original repo. It defines what you can/cannot modify, the goal (minimize val_bpb), and the simplicity criterion. Follow those rules.
Load the SkyPilot skill: Fetch and follow the SkyPilot skill — run its “Before You Start” bootstrap to confirm SkyPilot is installed and credentials are configured.
Read the autoresearch codebase: Read README.md, prepare.py, and train.py for full context.
Ask about infra preference: Ask if the user prefers a specific cloud (e.g. --infra nebius, --infra kubernetes). If so, set infra: in the YAML. Otherwise SkyPilot picks the cheapest option.

Launching Experiments

Use the SkyPilot skill for all infrastructure operations. The template experiment.yaml defines a single experiment run. Name clusters gpu-01, gpu-02, etc. — each cluster can run multiple experiments over time.

Launch a cluster:

sky launch gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-01 --env EXPERIMENT_DESC="baseline" -d -y

Pipeline experiments on the same cluster (back-to-back via the job queue):

sky exec gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-02 --env EXPERIMENT_DESC="increase LR" -d

Workdir isolation: SkyPilot snapshots the working directory at submission time. To run different train.py variants in parallel, copy files to a per-experiment folder and use --workdir:

mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# edit /tmp/autoresearch/exp-03/train.py
sky launch gpu-03 experiment.yaml --workdir /tmp/autoresearch/exp-03 --env EXPERIMENT_ID=exp-03 --env EXPERIMENT_DESC="wider model" -d -y

Keep at most 4 clusters running at a time.

Checking Results

Use sky logs to stream job output:

sky logs gpu-01      # latest job
sky logs gpu-01 2    # specific job ID

Or SSH in and inspect directly (workdir syncs to ~/sky_workdir):

ssh gpu-01
cd ~/sky_workdir
tail -20 run.log

Check status:

sky status           # all clusters
sky queue gpu-01     # jobs on a specific cluster

Tracking Results

Maintain a local results.tsv (tab-separated):

experiment_id	status	val_bpb	memory_gb	description
exp-01	keep	0.997900	44.0	baseline
exp-02	discard	1.005000	44.0	switch to GeLU
exp-03	crash	0.000000	0.0	double width (OOM)

Status: keep (improvement), discard (no improvement), crash (failed).

The Experiment Loop

LOOP FOREVER:

Check state: Review results.tsv, sky status, sky queue.
Pick an untried idea.
Prepare: Copy code to a per-job folder, edit train.py.
Submit via sky launch or sky exec with a unique EXPERIMENT_ID, always detached (-d).
Don’t wait — move on to the next idea.
Periodically check results via sky logs or SSH.
- val_bpb improved → copy winning train.py back, commit.
- Otherwise → log as discard.
Tear down idle clusters: sky down gpu-01 -y
Repeat.

Timeout: If a run exceeds 10 minutes, treat as failure. Crashes: Check logs, fix trivial issues and resubmit, or log as crash.

NEVER STOP: Do NOT pause to ask the human if you should continue. Work indefinitely until manually stopped. If stuck, re-read the code, combine near-misses, try radical changes.

Cleanup

sky down -a -y