Source: examples/autoresearch
Parallel Autoresearch with SkyPilot#
Run karpathy/autoresearch experiments in parallel on cloud GPUs using the SkyPilot skill. A local coding agent uses the skill to spin up GPU clusters, submit experiments, and parallelize work across multiple clusters.
SkyPilot Dashboard#

Architecture#
Local Agent (Claude Code, Codex, etc.)
|
|-- uses the SkyPilot skill to:
| - launch GPU clusters
| - submit experiment jobs (from experiment.yaml)
| - check job status and stream logs
| - SSH into clusters to inspect results
| - tear down idle clusters
|
|-- edits train.py with hypotheses
|-- tracks results in a local results.tsv
Agent runs locally, generates hypotheses, edits
train.pySkyPilot skill handles all infrastructure — the agent tells it what to do and the skill translates that into SkyPilot commands
SSH access — the agent SSHes into clusters to check experiment output directly
Files#
File |
Purpose |
|---|---|
|
SkyPilot task template: runs one experiment on a GPU cluster |
|
Agent instructions for using SkyPilot (give this to your coding agent) |
The original program.md from the autoresearch repo is fetched automatically during setup. It contains the experiment rules (what to modify, goals, constraints).
Prerequisites#
SkyPilot — Install following the SkyPilot installation guide.
Cloud credentials — Configure at least one cloud provider by running
sky check. See the cloud setup docs for details.
Quick Start#
# 1. Clone autoresearch
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# 2. Copy SkyPilot files into the repo
cp /path/to/this/example/experiment.yaml .
cp /path/to/this/example/instructions.md .
# 3. Prepare data
pip install uv && uv sync && uv run prepare.py
# 4. Install the SkyPilot skill for your agent
# See: https://docs.skypilot.co/en/latest/getting-started/skill.html
# 5. Tell your agent to start
# "Read instructions.md and start running parallel experiments"
The SkyPilot skill handles installation, credential setup, and all cluster operations. If SkyPilot isn’t installed yet, the skill will guide through that too.
Included files#
experiment.yaml
resources:
accelerators: {H100:1, H200:1}
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
workdir: .
envs:
EXPERIMENT_ID: baseline
EXPERIMENT_DESC: "baseline run"
setup: |
pip install uv
uv sync
uv run prepare.py
run: |
# Run the experiment (5-min fixed budget)
uv run train.py 2>&1 | tee run.log
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -ne 0 ]; then
echo "EXPERIMENT_STATUS: crash"
echo "Experiment ${EXPERIMENT_ID} crashed. Check logs above for details."
else
VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
if [ -z "$VAL_BPB" ] || [ -z "$PEAK_VRAM" ]; then
echo "EXPERIMENT_STATUS: crash"
echo "Experiment ${EXPERIMENT_ID} finished but failed to parse metrics from run.log."
else
MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
echo "EXPERIMENT_STATUS: done"
echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
fi
fi
echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"
Parallel Autoresearch with SkyPilot
You are an autonomous research agent running parallel GPU experiments via SkyPilot.
Setup
Read the autoresearch rules: Fetch and read program.md from the original repo. It defines what you can/cannot modify, the goal (minimize
val_bpb), and the simplicity criterion. Follow those rules.Load the SkyPilot skill: Fetch and follow the SkyPilot skill — run its “Before You Start” bootstrap to confirm SkyPilot is installed and credentials are configured.
Read the autoresearch codebase: Read
README.md,prepare.py, andtrain.pyfor full context.Ask about infra preference: Ask if the user prefers a specific cloud (e.g.
--infra nebius,--infra kubernetes). If so, setinfra:in the YAML. Otherwise SkyPilot picks the cheapest option.
Launching Experiments
Use the SkyPilot skill for all infrastructure operations. The template experiment.yaml defines a single experiment run. Name clusters gpu-01, gpu-02, etc. — each cluster can run multiple experiments over time.
Launch a cluster:
sky launch gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-01 --env EXPERIMENT_DESC="baseline" -d -y
Pipeline experiments on the same cluster (back-to-back via the job queue):
sky exec gpu-01 experiment.yaml --env EXPERIMENT_ID=exp-02 --env EXPERIMENT_DESC="increase LR" -d
Workdir isolation: SkyPilot snapshots the working directory at submission time. To run different train.py variants in parallel, copy files to a per-experiment folder and use --workdir:
mkdir -p /tmp/autoresearch/exp-03
cp train.py prepare.py pyproject.toml experiment.yaml /tmp/autoresearch/exp-03/
# edit /tmp/autoresearch/exp-03/train.py
sky launch gpu-03 experiment.yaml --workdir /tmp/autoresearch/exp-03 --env EXPERIMENT_ID=exp-03 --env EXPERIMENT_DESC="wider model" -d -y
Keep at most 4 clusters running at a time.
Checking Results
Use sky logs to stream job output:
sky logs gpu-01 # latest job
sky logs gpu-01 2 # specific job ID
Or SSH in and inspect directly (workdir syncs to ~/sky_workdir):
ssh gpu-01
cd ~/sky_workdir
tail -20 run.log
Check status:
sky status # all clusters
sky queue gpu-01 # jobs on a specific cluster
Tracking Results
Maintain a local results.tsv (tab-separated):
experiment_id status val_bpb memory_gb description
exp-01 keep 0.997900 44.0 baseline
exp-02 discard 1.005000 44.0 switch to GeLU
exp-03 crash 0.000000 0.0 double width (OOM)
Status: keep (improvement), discard (no improvement), crash (failed).
The Experiment Loop
LOOP FOREVER:
Check state: Review
results.tsv,sky status,sky queue.Pick an untried idea.
Prepare: Copy code to a per-job folder, edit
train.py.Submit via
sky launchorsky execwith a uniqueEXPERIMENT_ID, always detached (-d).Don’t wait — move on to the next idea.
Periodically check results via
sky logsor SSH.val_bpbimproved → copy winningtrain.pyback, commit.Otherwise → log as
discard.
Tear down idle clusters:
sky down gpu-01 -yRepeat.
Timeout: If a run exceeds 10 minutes, treat as failure. Crashes: Check logs, fix trivial issues and resubmit, or log as crash.
NEVER STOP: Do NOT pause to ask the human if you should continue. Work indefinitely until manually stopped. If stuck, re-read the code, combine near-misses, try radical changes.
Cleanup
sky down -a -y