Source: examples/autonomous-code-optimization
Autonomous Code Optimization with SkyPilot#
Run autoresearch / pi-autoresearch-style optimization loops against any open-source project, enhanced with academic literature search and parallelized across cloud VMs via SkyPilot. A local coding agent uses the SkyPilot skill to spin up VMs, submit experiments, and parallelize work across multiple VMs.
For a deep dive into methodology and results (llama.cpp CPU inference: +15% text generation throughput), check out the blog post.
How It Works#
Local Agent (Claude Code, Codex, etc.)
|
|-- uses the SkyPilot skill to:
| - launch cloud VMs (sky launch)
| - submit experiments (sky exec + experiment.yaml)
| - check results (sky logs / ssh)
|
|-- searches literature (arxiv, forks, PRs) for optimization ideas
|-- profiles the codebase to identify bottlenecks
|-- edits source code, fans out experiments in parallel
|-- keeps winners, discards losers, re-profiles, repeats
The agent writes its own benchmark script (autoresearch.sh) and correctness checks (autoresearch.checks.sh), then runs the loop: profile -> search literature -> experiment -> commit winners -> repeat.
Files#
File |
Purpose |
|---|---|
|
SkyPilot task template: builds, benchmarks, and checks one experiment on a cloud VM |
|
Full agent instructions (give this to your coding agent) |
|
One-command setup: installs SkyPilot, clones target repo, downloads files |
Quick Start#
# 1. Clone your target project
git clone https://github.com/<org>/<project>.git
cd <project>
# 2. Download the experiment template and agent instructions
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/experiment.yaml -o experiment.yaml
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/instructions.md -o instructions.md
# 3. Tell your agent to start
# "Read instructions.md and optimize <project> for <metric>.
# Use 4 SkyPilot VMs. Use AWS infra."
Or use the one-line setup:
export TARGET_REPO="https://github.com/<org>/<project>.git"
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/setup.sh | bash
The SkyPilot skill handles installation, credential setup, and all VM operations.
How It Differs from the GPU Autoresearch Example#
This Example |
||
|---|---|---|
Target |
ML training (karpathy/autoresearch) |
Any OSS project |
Compute |
GPU clusters (H100/H200) |
CPU VMs (cheap) |
Search strategy |
Agent brainstorms from code context |
Agent reads papers + profiles bottlenecks |
Cost |
~$300/8hr (GPU) |
~$20 (CPU VMs) + ~$9 (API) |
Included files#
experiment.yaml
# SkyPilot task template for autonomous code optimization experiments.
#
# Each experiment:
# 1. Installs dependencies (setup, runs once per VM)
# 2. Builds the project
# 3. Runs the benchmark (autoresearch.sh)
# 4. Runs correctness checks (autoresearch.checks.sh)
# 5. Reports METRIC lines and check results
#
# Usage:
# sky launch -c vm-01 experiment.yaml \
# --env EXPERIMENT_ID=exp-01 \
# --env EXPERIMENT_DESC="add SIMD to hash lookup" \
# -d -y
#
# The workdir is snapshotted at submission time, so parallel experiments
# with different code edits use --workdir to point at separate copies.
resources:
cpus: 4+
memory: 8+
# No GPU needed -- this is for CPU-bound code optimization.
# Override with --gpus if your target needs GPU benchmarking.
workdir: .
envs:
EXPERIMENT_ID: baseline
EXPERIMENT_DESC: "baseline measurement"
# Override these per-project via --env:
BUILD_CMD: "make -j$(nproc)"
BENCH_TIMEOUT: 300
CHECK_TIMEOUT: 300
setup: |
# Install project-specific dependencies.
# Runs once when the VM is first provisioned.
cd ~/sky_workdir
if [ -f setup_deps.sh ]; then
echo ">>> Running setup_deps.sh..."
bash setup_deps.sh
else
echo ">>> No setup_deps.sh found, running BUILD_CMD: ${BUILD_CMD}"
bash -c "${BUILD_CMD}"
fi
run: |
cd ~/sky_workdir
echo "========================================="
echo "EXPERIMENT: ${EXPERIMENT_ID}"
echo "DESC: ${EXPERIMENT_DESC}"
echo "========================================="
# --- Build ---
echo ">>> Building..."
set -o pipefail
if ! bash -c "${BUILD_CMD}" 2>&1 | tail -30; then
echo "EXPERIMENT_STATUS: crash"
echo "EXPERIMENT_ERROR: build failed"
exit 0 # exit 0 so SkyPilot marks the job as SUCCEEDED for log retrieval
fi
# --- Benchmark ---
if [ ! -f autoresearch.sh ]; then
echo "EXPERIMENT_STATUS: crash"
echo "EXPERIMENT_ERROR: autoresearch.sh not found"
exit 0
fi
echo ">>> Running benchmark (timeout: ${BENCH_TIMEOUT}s)..."
BENCH_OUTPUT=$(timeout "${BENCH_TIMEOUT}" bash autoresearch.sh 2>&1) || {
EXIT_CODE=$?
echo "$BENCH_OUTPUT"
if [ $EXIT_CODE -eq 124 ]; then
echo "EXPERIMENT_STATUS: crash"
echo "EXPERIMENT_ERROR: benchmark timed out after ${BENCH_TIMEOUT}s"
else
echo "EXPERIMENT_STATUS: crash"
echo "EXPERIMENT_ERROR: benchmark failed (exit $EXIT_CODE)"
fi
exit 0
}
echo "$BENCH_OUTPUT"
# Extract METRIC lines
METRICS=$(echo "$BENCH_OUTPUT" | grep -E '^METRIC\s+\S+=\S+' || true)
if [ -z "$METRICS" ]; then
echo "EXPERIMENT_STATUS: crash"
echo "EXPERIMENT_ERROR: no METRIC lines found in benchmark output"
exit 0
fi
# --- Correctness checks ---
CHECKS_PASSED=true
if [ -f autoresearch.checks.sh ]; then
echo ">>> Running correctness checks (timeout: ${CHECK_TIMEOUT}s)..."
CHECK_OUTPUT=$(timeout "${CHECK_TIMEOUT}" bash autoresearch.checks.sh 2>&1) || {
EXIT_CODE=$?
echo "$CHECK_OUTPUT"
if [ $EXIT_CODE -eq 124 ]; then
echo "CHECKS_STATUS: timeout"
else
echo "CHECKS_STATUS: failed (exit $EXIT_CODE)"
fi
CHECKS_PASSED=false
}
if [ "$CHECKS_PASSED" = true ]; then
echo "CHECKS_STATUS: passed"
fi
else
echo "CHECKS_STATUS: skipped (no autoresearch.checks.sh)"
fi
# --- Report ---
echo "========================================="
echo "EXPERIMENT_ID: ${EXPERIMENT_ID}"
echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"
echo "$METRICS"
if [ "$CHECKS_PASSED" = true ]; then
echo "EXPERIMENT_STATUS: done"
else
echo "EXPERIMENT_STATUS: checks_failed"
fi
echo "========================================="
Literature-Guided Autoresearch with SkyPilot
You are an autonomous research agent that optimizes open-source software by combining academic literature search with parallel cloud-based experiments via SkyPilot.
CRITICAL CONSTRAINTS
You must only improve performance by changing the project’s source code. The goal is to make the software genuinely faster — not to make the benchmark report better numbers.
autoresearch.shandautoresearch.checks.share PROTECTED FILES — never modify them after initial creation and validation.Only modify the project’s source code. Do not modify build scripts, compiler flags, benchmark parameters, server configuration, or OS settings. The user may specify which directories are in scope.
Do not change the memory allocator, build flags (optimization level, LTO, PGO, -march), or any runtime configuration. These are not source code improvements.
Phase 0: Setup
Load the SkyPilot skill: Fetch and follow the SkyPilot skill — run its “Before You Start” bootstrap.
Understand the target project: Read the README, build instructions, test suite, and existing benchmarks.
Ask the user: optimization target and metric, direction (lower/higher better), constraints (e.g., “don’t change the public API”), cloud preference (e.g.,
--infra aws).Create a branch:
git checkout -b autoresearch/<target>-<date>
Phase 1: Baseline and Profiling
Write benchmark and checks scripts
autoresearch.sh — outputs METRIC name=value lines. Must use set -euo pipefail, run the benchmark multiple times (typically 5), report the median, and complete in under 5 minutes.
#!/bin/bash
set -euo pipefail
TIMES=()
for i in $(seq 1 5); do
T=$( TIMEFORMAT='%R'; { time <benchmark_command> ; } 2>&1 )
TIMES+=("$T")
done
IFS=$'\n' sorted=($(sort -n <<<"${TIMES[*]}")); unset IFS
echo "METRIC time_s=${sorted[2]}"
autoresearch.checks.sh — runs the test suite (or a fast subset). Must use set -euo pipefail, exit 0 on success. Use non-interactive flags — interactive tools hang silently.
setup_deps.sh (optional) — installs project dependencies. Runs once during VM provisioning.
Validate scripts immediately
Do not skip this. SSH into a VM and run both scripts manually. Cross-check METRIC output against the tool’s native output. Verify the numbers are in a reasonable ballpark — a parsing bug can report 14 t/s when the real number is 52 t/s, and you’ll run entire waves of experiments against a wrong baseline before noticing.
Establish baseline
sky launch -c vm-01 experiment.yaml --env EXPERIMENT_ID=baseline --env EXPERIMENT_DESC="baseline measurement" -d -y
Record the baseline in autoresearch.md and autoresearch.jsonl.
Profile — determine the bottleneck type
Run a profiler (perf, gprof, py-spy, etc.) to identify hot functions. Determine whether the workload is compute-bound or memory-bandwidth-bound — this determines the entire optimization strategy:
Compute-bound (high CPU utilization, high IPC): optimize hot kernels — better algorithms, SIMD, loop unrolling, operator fusion.
Memory-bandwidth-bound (CPU stalls on data, low IPC): compute path changes will show near-zero improvement. Focus on reducing data movement — fuse operations, improve cache locality, reduce working set size.
Quick diagnostic: if optimizing the hottest function shows 0% improvement, the workload is likely memory-bound. Pivot immediately.
Create tracking documents
autoresearch.md: living session document (objective, metric, baseline, literature findings, experiment log)autoresearch.ideas.md: ranked queue of experiment ideasIMPROVEMENTS.md: final deliverable — each committed optimization with problem, solution, measured impact, and paper referencesautoresearch.jsonl: append-only experiment log, one JSON line per experiment:{"type":"config","name":"Optimize X","metricName":"time_s","metricUnit":"s","bestDirection":"lower"} {"run":1,"metric":18.5,"metrics":{},"status":"keep","description":"baseline","timestamp":1700000000,"literature_source":null}
Phase 2: Literature Search
Search for academic papers, blog posts, and prior work relevant to the target. This is what distinguishes this approach from simple autotune — you are mining human knowledge before running experiments.
What to search for (roughly in order of value):
Forks and competitors — study forks that claim better performance. Their commit history contains proven optimizations you can adapt.
Project PRs and issues — merged PRs labeled “performance”, known bottlenecks, prior optimization attempts.
arxiv / Google Scholar — papers about optimizing the project or its domain. Download relevant PDFs to
papers/and read them with the PDF skill.Technique papers — general optimization techniques applicable to the domain (operator fusion, SIMD, cache-oblivious algorithms, lock-free data structures).
Rank ideas by expected impact in autoresearch.ideas.md and record findings in autoresearch.md.
Phase 3: Parallel Experiments
Launching experiments
For each experiment, prepare an isolated copy and submit:
mkdir -p /tmp/autoresearch/exp-03
rsync -a --exclude='.git' --exclude='build/' . /tmp/autoresearch/exp-03/
# ... edit source files in the copy ...
sky launch -c vm-03 experiment.yaml \
--workdir /tmp/autoresearch/exp-03 \
--env EXPERIMENT_ID=exp-03 \
--env EXPERIMENT_DESC="description (paper: Author2024)" \
-d -y
Pipeline experiments on existing VMs with sky exec:
sky exec vm-01 experiment.yaml \
--workdir /tmp/autoresearch/exp-04 \
--env EXPERIMENT_ID=exp-04 \
--env EXPERIMENT_DESC="description" \
-d
Never tear down VMs. Reuse them via sky exec. Launch once, keep for the entire session.
Incremental build warning: sky exec with different workdirs can leave stale object files from previous experiments, producing binaries that mix old and new code. Use sky exec for rapid screening, but verify any promising result with a clean A/B build (see below).
Noisy VMs
Cloud VMs on shared hardware can show 5-30% variance. If stddev > 5% of the mean, the VM has noisy neighbors — stop it (sky stop vm-XX) and launch a fresh one. Only trust results where stddev < 2%.
Clean A/B verification
Required before committing any improvement. On a single quiet VM:
ssh vm-01
cd ~/sky_workdir
# Clean build + benchmark OPTIMIZED code
rm -rf build && <full-build-command> && <benchmark-command>
# Revert, clean build + benchmark BASELINE
git checkout <baseline-commit> -- <changed-files>
rm -rf build && <full-build-command> && <benchmark-command>
# Restore optimized code
git checkout HEAD -- <changed-files>
Same compiler, hardware, thermal state, clean object files. This is the gold standard.
Decision logic
Improved + checks pass → clean A/B verify → commit winner, update baseline
Improved + checks fail → investigate, fix and retry, or discard
No improvement → discard, log, move on
Crash/timeout → check logs, fix trivial issues and resubmit, or discard
When committing a winner, include the metric delta and literature source in the commit message. Update autoresearch.md, autoresearch.ideas.md, and IMPROVEMENTS.md.
Pivoting
If several experiments in a row show no improvement, pivot to new strategies.
The Loop
LOOP FOREVER:
1. Check state: autoresearch.ideas.md, sky status, sky queue
2. If ideas queue is empty → re-profile, search literature, study forks, add ideas
3. Pick highest-ranked untried idea
4. Prepare isolated workdir, edit source files
5. Submit via sky exec (or sky launch for new VMs), always detached (-d)
6. Don't wait — prepare and submit the next idea immediately
7. Poll results individually (sky queue <vm>, then sky logs <vm> <job-id>)
As soon as a VM finishes, submit the next experiment to it
8. Apply decision logic: keep/discard/crash/checks_failed
9. Replace noisy VMs (stddev > 5%)
10. Repeat
NEVER STOP. Do not pause to ask the human if you should continue. Work indefinitely until manually stopped. If stuck, re-read the literature, combine near-misses, try radical changes, or search for new papers.
Only tear down VMs when the user explicitly asks: sky down -a -y
#!/usr/bin/env bash
set -euo pipefail
# Autonomous Code Optimization with SkyPilot -- one-command setup.
#
# Usage:
# TARGET_REPO=https://github.com/valkey-io/valkey.git bash setup.sh
# TARGET_REPO=https://github.com/postgres/postgres.git bash setup.sh
#
# Or without TARGET_REPO to set up in an existing project directory:
# cd /path/to/your/project && bash setup.sh
EXAMPLES_BASE="https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization"
echo "=== Autonomous Code Optimization + SkyPilot setup ==="
echo ""
# 1. Install uv if missing
if ! command -v uv &>/dev/null; then
echo "[1/4] Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
echo "[1/4] uv already installed ($(uv --version))"
fi
# 2. Install SkyPilot if missing
if ! command -v sky &>/dev/null; then
echo "[2/4] Installing SkyPilot via uv..."
uv tool install skypilot
export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
echo "[2/4] SkyPilot already installed ($(sky --version 2>/dev/null || echo 'unknown version'))"
fi
# 3. Clone target repo (if TARGET_REPO is set)
if [ -n "${TARGET_REPO:-}" ]; then
PROJECT_DIR=$(basename "$TARGET_REPO" .git)
echo "[3/4] Cloning $TARGET_REPO..."
if [ -d "$PROJECT_DIR" ]; then
echo " Directory '$PROJECT_DIR' already exists, skipping clone."
else
git clone --depth 1 "$TARGET_REPO" "$PROJECT_DIR"
fi
cd "$PROJECT_DIR"
else
PROJECT_DIR="$(basename "$(pwd)")"
echo "[3/4] No TARGET_REPO set, using current directory: $(pwd)"
if [ ! -d .git ]; then
echo " WARNING: Not a git repository. Run 'git init' first."
fi
fi
# 4. Download SkyPilot autoresearch files
echo "[4/4] Downloading experiment.yaml and instructions.md..."
curl -fsSL "$EXAMPLES_BASE/experiment.yaml" -o experiment.yaml
curl -fsSL "$EXAMPLES_BASE/instructions.md" -o instructions.md
echo ""
echo "=== Credential check ==="
SKY_CHECK=$(sky check 2>&1 || true)
if echo "$SKY_CHECK" | grep -q ": enabled"; then
echo "Cloud credentials OK."
else
echo ""
echo "No cloud credentials detected. Run 'sky check' to configure access."
echo "Docs: https://docs.skypilot.co/en/latest/getting-started/installation.html#cloud-account-setup"
echo ""
echo "You can still continue -- SkyPilot will prompt you when you launch the first job."
fi
echo ""
echo "================================================================"
echo " Setup complete! Project: $PROJECT_DIR"
echo "================================================================"
echo ""
echo " Open Claude Code / Codex / any coding agent in this directory,"
echo " then paste this prompt:"
echo ""
echo " Follow instructions.md to identify significant optimizations in"
echo " $PROJECT_DIR. Feel free to use latest research literature to find"
echo " creative and new techniques. Use 4 SkyPilot VMs simultaneously."
echo ""
echo " Example prompts:"
echo " 'Follow instructions.md to optimize llama.cpp for CPU inference tokens/sec."
echo " Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use AWS infra.'"
echo " 'Follow instructions.md to optimize valkey for SET ops/sec."
echo " Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use GCP infra.'"
echo " 'Follow instructions.md to optimize PostgreSQL for pgbench TPS."
echo " Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use AWS infra.'"
echo ""
echo " Monitor VMs: sky status"
echo " Stream logs: sky logs <vm-name>"
echo " Tear down all: sky down -a -y"
echo "================================================================"