Source: examples/autonomous-code-optimization

Autonomous Code Optimization with SkyPilot#

Run autoresearch / pi-autoresearch-style optimization loops against any open-source project, enhanced with academic literature search and parallelized across cloud VMs via SkyPilot. A local coding agent uses the SkyPilot skill to spin up VMs, submit experiments, and parallelize work across multiple VMs.

For a deep dive into methodology and results (llama.cpp CPU inference: +15% text generation throughput), check out the blog post.

How It Works#

Local Agent (Claude Code, Codex, etc.)
  |
  |-- uses the SkyPilot skill to:
  |     - launch cloud VMs (sky launch)
  |     - submit experiments (sky exec + experiment.yaml)
  |     - check results (sky logs / ssh)
  |
  |-- searches literature (arxiv, forks, PRs) for optimization ideas
  |-- profiles the codebase to identify bottlenecks
  |-- edits source code, fans out experiments in parallel
  |-- keeps winners, discards losers, re-profiles, repeats

The agent writes its own benchmark script (autoresearch.sh) and correctness checks (autoresearch.checks.sh), then runs the loop: profile -> search literature -> experiment -> commit winners -> repeat.

Files#

File	Purpose
`experiment.yaml`	SkyPilot task template: builds, benchmarks, and checks one experiment on a cloud VM
`instructions.md`	Full agent instructions (give this to your coding agent)
`setup.sh`	One-command setup: installs SkyPilot, clones target repo, downloads files

Quick Start#

# 1. Clone your target project
git clone https://github.com/<org>/<project>.git
cd <project>

# 2. Download the experiment template and agent instructions
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/experiment.yaml -o experiment.yaml
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/instructions.md -o instructions.md

# 3. Tell your agent to start
#    "Read instructions.md and optimize <project> for <metric>.
#     Use 4 SkyPilot VMs. Use AWS infra."

Or use the one-line setup:

export TARGET_REPO="https://github.com/<org>/<project>.git"
curl -fsSL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization/setup.sh | bash

The SkyPilot skill handles installation, credential setup, and all VM operations.

How It Differs from the GPU Autoresearch Example#

	GPU Autoresearch	This Example
Target	ML training (karpathy/autoresearch)	Any OSS project
Compute	GPU clusters (H100/H200)	CPU VMs (cheap)
Search strategy	Agent brainstorms from code context	Agent reads papers + profiles bottlenecks
Cost	~$300/8hr (GPU)	~$20 (CPU VMs) + ~$9 (API)

Included files#

experiment.yaml

# SkyPilot task template for autonomous code optimization experiments.
#
# Each experiment:
#   1. Installs dependencies (setup, runs once per VM)
#   2. Builds the project
#   3. Runs the benchmark (autoresearch.sh)
#   4. Runs correctness checks (autoresearch.checks.sh)
#   5. Reports METRIC lines and check results
#
# Usage:
#   sky launch -c vm-01 experiment.yaml \
#     --env EXPERIMENT_ID=exp-01 \
#     --env EXPERIMENT_DESC="add SIMD to hash lookup" \
#     -d -y
#
# The workdir is snapshotted at submission time, so parallel experiments
# with different code edits use --workdir to point at separate copies.

resources:
  cpus: 4+
  memory: 8+
  # No GPU needed -- this is for CPU-bound code optimization.
  # Override with --gpus if your target needs GPU benchmarking.

workdir: .

envs:
  EXPERIMENT_ID: baseline
  EXPERIMENT_DESC: "baseline measurement"
  # Override these per-project via --env:
  BUILD_CMD: "make -j$(nproc)"
  BENCH_TIMEOUT: 300
  CHECK_TIMEOUT: 300

setup: |
  # Install project-specific dependencies.
  # Runs once when the VM is first provisioned.
  cd ~/sky_workdir
  if [ -f setup_deps.sh ]; then
    echo ">>> Running setup_deps.sh..."
    bash setup_deps.sh
  else
    echo ">>> No setup_deps.sh found, running BUILD_CMD: ${BUILD_CMD}"
    bash -c "${BUILD_CMD}"
  fi

run: |
  cd ~/sky_workdir
  echo "========================================="
  echo "EXPERIMENT: ${EXPERIMENT_ID}"
  echo "DESC: ${EXPERIMENT_DESC}"
  echo "========================================="

  # --- Build ---
  echo ">>> Building..."
  set -o pipefail
  if ! bash -c "${BUILD_CMD}" 2>&1 | tail -30; then
    echo "EXPERIMENT_STATUS: crash"
    echo "EXPERIMENT_ERROR: build failed"
    exit 0  # exit 0 so SkyPilot marks the job as SUCCEEDED for log retrieval
  fi

  # --- Benchmark ---
  if [ ! -f autoresearch.sh ]; then
    echo "EXPERIMENT_STATUS: crash"
    echo "EXPERIMENT_ERROR: autoresearch.sh not found"
    exit 0
  fi

  echo ">>> Running benchmark (timeout: ${BENCH_TIMEOUT}s)..."
  BENCH_OUTPUT=$(timeout "${BENCH_TIMEOUT}" bash autoresearch.sh 2>&1) || {
    EXIT_CODE=$?
    echo "$BENCH_OUTPUT"
    if [ $EXIT_CODE -eq 124 ]; then
      echo "EXPERIMENT_STATUS: crash"
      echo "EXPERIMENT_ERROR: benchmark timed out after ${BENCH_TIMEOUT}s"
    else
      echo "EXPERIMENT_STATUS: crash"
      echo "EXPERIMENT_ERROR: benchmark failed (exit $EXIT_CODE)"
    fi
    exit 0
  }
  echo "$BENCH_OUTPUT"

  # Extract METRIC lines
  METRICS=$(echo "$BENCH_OUTPUT" | grep -E '^METRIC\s+\S+=\S+' || true)
  if [ -z "$METRICS" ]; then
    echo "EXPERIMENT_STATUS: crash"
    echo "EXPERIMENT_ERROR: no METRIC lines found in benchmark output"
    exit 0
  fi

  # --- Correctness checks ---
  CHECKS_PASSED=true
  if [ -f autoresearch.checks.sh ]; then
    echo ">>> Running correctness checks (timeout: ${CHECK_TIMEOUT}s)..."
    CHECK_OUTPUT=$(timeout "${CHECK_TIMEOUT}" bash autoresearch.checks.sh 2>&1) || {
      EXIT_CODE=$?
      echo "$CHECK_OUTPUT"
      if [ $EXIT_CODE -eq 124 ]; then
        echo "CHECKS_STATUS: timeout"
      else
        echo "CHECKS_STATUS: failed (exit $EXIT_CODE)"
      fi
      CHECKS_PASSED=false
    }
    if [ "$CHECKS_PASSED" = true ]; then
      echo "CHECKS_STATUS: passed"
    fi
  else
    echo "CHECKS_STATUS: skipped (no autoresearch.checks.sh)"
  fi

  # --- Report ---
  echo "========================================="
  echo "EXPERIMENT_ID: ${EXPERIMENT_ID}"
  echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"
  echo "$METRICS"
  if [ "$CHECKS_PASSED" = true ]; then
    echo "EXPERIMENT_STATUS: done"
  else
    echo "EXPERIMENT_STATUS: checks_failed"
  fi
  echo "========================================="

instructions.md

Literature-Guided Autoresearch with SkyPilot

You are an autonomous research agent that optimizes open-source software by combining academic literature search with parallel cloud-based experiments via SkyPilot.

CRITICAL CONSTRAINTS

You must only improve performance by changing the project’s source code. The goal is to make the software genuinely faster — not to make the benchmark report better numbers.

autoresearch.sh and autoresearch.checks.sh are PROTECTED FILES — never modify them after initial creation and validation.
Only modify the project’s source code. Do not modify build scripts, compiler flags, benchmark parameters, server configuration, or OS settings. The user may specify which directories are in scope.
Do not change the memory allocator, build flags (optimization level, LTO, PGO, -march), or any runtime configuration. These are not source code improvements.

Phase 0: Setup

Load the SkyPilot skill: Fetch and follow the SkyPilot skill — run its “Before You Start” bootstrap.
Understand the target project: Read the README, build instructions, test suite, and existing benchmarks.
Ask the user: optimization target and metric, direction (lower/higher better), constraints (e.g., “don’t change the public API”), cloud preference (e.g., --infra aws).
Create a branch: git checkout -b autoresearch/<target>-<date>

Phase 1: Baseline and Profiling

Write benchmark and checks scripts

autoresearch.sh — outputs METRIC name=value lines. Must use set -euo pipefail, run the benchmark multiple times (typically 5), report the median, and complete in under 5 minutes.

#!/bin/bash
set -euo pipefail
TIMES=()
for i in $(seq 1 5); do
  T=$( TIMEFORMAT='%R'; { time <benchmark_command> ; } 2>&1 )
  TIMES+=("$T")
done
IFS=$'\n' sorted=($(sort -n <<<"${TIMES[*]}")); unset IFS
echo "METRIC time_s=${sorted[2]}"

autoresearch.checks.sh — runs the test suite (or a fast subset). Must use set -euo pipefail, exit 0 on success. Use non-interactive flags — interactive tools hang silently.

setup_deps.sh (optional) — installs project dependencies. Runs once during VM provisioning.

Validate scripts immediately

Do not skip this. SSH into a VM and run both scripts manually. Cross-check METRIC output against the tool’s native output. Verify the numbers are in a reasonable ballpark — a parsing bug can report 14 t/s when the real number is 52 t/s, and you’ll run entire waves of experiments against a wrong baseline before noticing.

Establish baseline

sky launch -c vm-01 experiment.yaml --env EXPERIMENT_ID=baseline --env EXPERIMENT_DESC="baseline measurement" -d -y

Record the baseline in autoresearch.md and autoresearch.jsonl.

Profile — determine the bottleneck type

Run a profiler (perf, gprof, py-spy, etc.) to identify hot functions. Determine whether the workload is compute-bound or memory-bandwidth-bound — this determines the entire optimization strategy:

Compute-bound (high CPU utilization, high IPC): optimize hot kernels — better algorithms, SIMD, loop unrolling, operator fusion.
Memory-bandwidth-bound (CPU stalls on data, low IPC): compute path changes will show near-zero improvement. Focus on reducing data movement — fuse operations, improve cache locality, reduce working set size.

Quick diagnostic: if optimizing the hottest function shows 0% improvement, the workload is likely memory-bound. Pivot immediately.

Create tracking documents

autoresearch.md: living session document (objective, metric, baseline, literature findings, experiment log)
autoresearch.ideas.md: ranked queue of experiment ideas
IMPROVEMENTS.md: final deliverable — each committed optimization with problem, solution, measured impact, and paper references

autoresearch.jsonl: append-only experiment log, one JSON line per experiment:

{"type":"config","name":"Optimize X","metricName":"time_s","metricUnit":"s","bestDirection":"lower"}
{"run":1,"metric":18.5,"metrics":{},"status":"keep","description":"baseline","timestamp":1700000000,"literature_source":null}

Phase 2: Literature Search

Search for academic papers, blog posts, and prior work relevant to the target. This is what distinguishes this approach from simple autotune — you are mining human knowledge before running experiments.

What to search for (roughly in order of value):

Forks and competitors — study forks that claim better performance. Their commit history contains proven optimizations you can adapt.
Project PRs and issues — merged PRs labeled “performance”, known bottlenecks, prior optimization attempts.
arxiv / Google Scholar — papers about optimizing the project or its domain. Download relevant PDFs to papers/ and read them with the PDF skill.
Technique papers — general optimization techniques applicable to the domain (operator fusion, SIMD, cache-oblivious algorithms, lock-free data structures).

Rank ideas by expected impact in autoresearch.ideas.md and record findings in autoresearch.md.

Phase 3: Parallel Experiments

Launching experiments

For each experiment, prepare an isolated copy and submit:

mkdir -p /tmp/autoresearch/exp-03
rsync -a --exclude='.git' --exclude='build/' . /tmp/autoresearch/exp-03/
# ... edit source files in the copy ...

sky launch -c vm-03 experiment.yaml \
  --workdir /tmp/autoresearch/exp-03 \
  --env EXPERIMENT_ID=exp-03 \
  --env EXPERIMENT_DESC="description (paper: Author2024)" \
  -d -y

Pipeline experiments on existing VMs with sky exec:

sky exec vm-01 experiment.yaml \
  --workdir /tmp/autoresearch/exp-04 \
  --env EXPERIMENT_ID=exp-04 \
  --env EXPERIMENT_DESC="description" \
  -d

Never tear down VMs. Reuse them via sky exec. Launch once, keep for the entire session.

Incremental build warning: sky exec with different workdirs can leave stale object files from previous experiments, producing binaries that mix old and new code. Use sky exec for rapid screening, but verify any promising result with a clean A/B build (see below).

Noisy VMs

Cloud VMs on shared hardware can show 5-30% variance. If stddev > 5% of the mean, the VM has noisy neighbors — stop it (sky stop vm-XX) and launch a fresh one. Only trust results where stddev < 2%.

Clean A/B verification

Required before committing any improvement. On a single quiet VM:

ssh vm-01
cd ~/sky_workdir
# Clean build + benchmark OPTIMIZED code
rm -rf build && <full-build-command> && <benchmark-command>
# Revert, clean build + benchmark BASELINE
git checkout <baseline-commit> -- <changed-files>
rm -rf build && <full-build-command> && <benchmark-command>
# Restore optimized code
git checkout HEAD -- <changed-files>

Same compiler, hardware, thermal state, clean object files. This is the gold standard.

Decision logic

Improved + checks pass → clean A/B verify → commit winner, update baseline
Improved + checks fail → investigate, fix and retry, or discard
No improvement → discard, log, move on
Crash/timeout → check logs, fix trivial issues and resubmit, or discard

When committing a winner, include the metric delta and literature source in the commit message. Update autoresearch.md, autoresearch.ideas.md, and IMPROVEMENTS.md.

Pivoting

If several experiments in a row show no improvement, pivot to new strategies.

The Loop

LOOP FOREVER:
Check state: autoresearch.ideas.md, sky status, sky queue
If ideas queue is empty → re-profile, search literature, study forks, add ideas
Pick highest-ranked untried idea
Prepare isolated workdir, edit source files
Submit via sky exec (or sky launch for new VMs), always detached (-d)
Don't wait — prepare and submit the next idea immediately
Poll results individually (sky queue <vm>, then sky logs <vm> <job-id>)
     As soon as a VM finishes, submit the next experiment to it
Apply decision logic: keep/discard/crash/checks_failed
Replace noisy VMs (stddev > 5%)
Repeat

NEVER STOP. Do not pause to ask the human if you should continue. Work indefinitely until manually stopped. If stuck, re-read the literature, combine near-misses, try radical changes, or search for new papers.

Only tear down VMs when the user explicitly asks: sky down -a -y

setup.sh

#!/usr/bin/env bash
set -euo pipefail

# Autonomous Code Optimization with SkyPilot -- one-command setup.
#
# Usage:
#   TARGET_REPO=https://github.com/valkey-io/valkey.git bash setup.sh
#   TARGET_REPO=https://github.com/postgres/postgres.git bash setup.sh
#
# Or without TARGET_REPO to set up in an existing project directory:
#   cd /path/to/your/project && bash setup.sh

EXAMPLES_BASE="https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autonomous-code-optimization"

echo "=== Autonomous Code Optimization + SkyPilot setup ==="
echo ""

# 1. Install uv if missing
if ! command -v uv &>/dev/null; then
    echo "[1/4] Installing uv..."
    curl -LsSf https://astral.sh/uv/install.sh | sh
    export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
    echo "[1/4] uv already installed ($(uv --version))"
fi

# 2. Install SkyPilot if missing
if ! command -v sky &>/dev/null; then
    echo "[2/4] Installing SkyPilot via uv..."
    uv tool install skypilot
    export PATH="${UV_TOOL_BIN_DIR:-$HOME/.local/bin}:$HOME/.cargo/bin:$PATH"
else
    echo "[2/4] SkyPilot already installed ($(sky --version 2>/dev/null || echo 'unknown version'))"
fi

# 3. Clone target repo (if TARGET_REPO is set)
if [ -n "${TARGET_REPO:-}" ]; then
    PROJECT_DIR=$(basename "$TARGET_REPO" .git)
    echo "[3/4] Cloning $TARGET_REPO..."
    if [ -d "$PROJECT_DIR" ]; then
        echo "      Directory '$PROJECT_DIR' already exists, skipping clone."
    else
        git clone --depth 1 "$TARGET_REPO" "$PROJECT_DIR"
    fi
    cd "$PROJECT_DIR"
else
    PROJECT_DIR="$(basename "$(pwd)")"
    echo "[3/4] No TARGET_REPO set, using current directory: $(pwd)"
    if [ ! -d .git ]; then
        echo "      WARNING: Not a git repository. Run 'git init' first."
    fi
fi

# 4. Download SkyPilot autoresearch files
echo "[4/4] Downloading experiment.yaml and instructions.md..."
curl -fsSL "$EXAMPLES_BASE/experiment.yaml" -o experiment.yaml
curl -fsSL "$EXAMPLES_BASE/instructions.md"  -o instructions.md

echo ""
echo "=== Credential check ==="
SKY_CHECK=$(sky check 2>&1 || true)
if echo "$SKY_CHECK" | grep -q ": enabled"; then
    echo "Cloud credentials OK."
else
    echo ""
    echo "No cloud credentials detected. Run 'sky check' to configure access."
    echo "Docs: https://docs.skypilot.co/en/latest/getting-started/installation.html#cloud-account-setup"
    echo ""
    echo "You can still continue -- SkyPilot will prompt you when you launch the first job."
fi

echo ""
echo "================================================================"
echo " Setup complete!  Project: $PROJECT_DIR"
echo "================================================================"
echo ""
echo " Open Claude Code / Codex / any coding agent in this directory,"
echo " then paste this prompt:"
echo ""
echo "   Follow instructions.md to identify significant optimizations in"
echo "   $PROJECT_DIR. Feel free to use latest research literature to find"
echo "   creative and new techniques. Use 4 SkyPilot VMs simultaneously."
echo ""
echo " Example prompts:"
echo "   'Follow instructions.md to optimize llama.cpp for CPU inference tokens/sec."
echo "    Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use AWS infra.'"
echo "   'Follow instructions.md to optimize valkey for SET ops/sec."
echo "    Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use GCP infra.'"
echo "   'Follow instructions.md to optimize PostgreSQL for pgbench TPS."
echo "    Use latest research literature. Use 4 SkyPilot VMs simultaneously. Use AWS infra.'"
echo ""
echo " Monitor VMs:   sky status"
echo " Stream logs:   sky logs <vm-name>"
echo " Tear down all: sky down -a -y"
echo "================================================================"