What Is Autoresearch and How to Use It

Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change to something, test it against a binary evaluation checklist, keep the change if the score improves, revert if not, repeat. The core insight is that this simple loop — when automated and run continuously — can drive dramatic quality improvements without human intervention. A landing page skill went from 56% to 92% pass rate in just 4 rounds (from Autoresearch).

What makes autoresearch powerful isn't the algorithm (hill-climbing is textbook optimization). It's that LLMs make the "generate a meaningful mutation" step trivial. The agent can read the checklist, read the current output, reason about what's failing, and propose a targeted fix — all without human involvement. That closes the loop and makes continuous automated improvement possible for the first time on creative, subjective outputs like copy, prompts, and skill definitions (from ole lehmann 10x claude skills autoresearch).

The Core Method

Step 1: Define Quality with a Binary Checklist

This is the most important step. Your checklist must use yes/no questions, not 1-10 ratings. Vague scales produce inconsistent scores that the optimization loop can't act on. Binary criteria like "Does the headline include a specific number?" or "Is there a clear call-to-action in the first paragraph?" are consistent and automatable (from ole lehmann 10x claude skills autoresearch).

The sweet spot is 3-6 questions. Fewer than 3 and you don't capture enough dimensions of quality. More than 6 and the skill starts gaming the checklist — optimizing for the letter of each question while missing the spirit of good output, like a student who memorizes answers without understanding the material (from ole lehmann 10x claude skills autoresearch).

Step 2: Make One Small, Concrete Change

Effective mutations are specific and concrete, not abstract. Examples that work:

Examples that don't work: "write better copy," "be more engaging," "improve the tone." These are too vague for the agent to act on consistently (from ole lehmann 10x claude skills autoresearch).

Step 3: Score and Decide

Run the updated skill against the checklist. If the overall score improves, keep the change. If not, revert. The system also catches trade-off regressions — a tighter word count might improve conciseness (checklist item 3 passes) but degrade CTA quality (checklist item 5 fails). If the net score drops, the change gets reverted even though it improved one dimension (from ole lehmann 10x claude skills autoresearch).

Step 4: Run Autonomously

Set up a live dashboard that auto-refreshes every 10 seconds showing the score chart, pass/fail breakdown per checklist item, and a changelog of every mutation attempted. Walk away. The agent iterates continuously and stops automatically when it hits 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch).

Step 5: Keep the Changelog

This is the most valuable artifact the system produces — more valuable than the optimized skill itself. The changelog documents every attempted change and its result: what was tried, whether it helped or hurt, and by how much. This serves as institutional knowledge about what works for that specific skill. Future models (or future you) can read the changelog and pick up exactly where the last optimization round left off, avoiding re-exploring dead ends (from ole lehmann 10x claude skills autoresearch).

Practical Results

Ole Lehmann's landing page copy skill: 56% → 92% in 4 rounds. Each round took minutes. The total optimization time was under an hour for a result that would have taken days of manual A/B testing (from ole lehmann 10x claude skills autoresearch).

The pattern generalizes beyond prompts. One person applied the same loop to web performance and optimized page load from 1,100ms to 67ms in 67 rounds — a 16x improvement through pure automated iteration (from ole lehmann 10x claude skills autoresearch).

Scaling Autoresearch

At Skill-Library Scale

When you have 10 skills, manual tuning works. When you have 190+, it doesn't. George Nuri combined autoresearch with Hamel Husain's evals-skills framework to create continuous automated skill improvement pipelines — running "all day in the background" across an entire "AI PM OS" library. This is the pattern: AI agents improving other AI agents' prompts and skills, with humans only setting the quality criteria (from autoresearch skill tuning evals).

With Parallel GPU Clusters

Karpathy's original autoresearch runs one experiment at a time. Sequential search on a single GPU yields ~10 experiments per hour. Human researchers grab a cluster and parallelize. Agents couldn't — until SkyPilot gave them GPU management skills (from parallel gpu autoresearch skypilot).

The SkyPilot agent skill teaches Claude Code to manage GPU clusters autonomously — provisioning 16 GPUs on Kubernetes, submitting jobs, checking logs, and pipelining experiments. Results: 910 experiments in 8 hours, a 9x speedup to the same best result. Total cost: ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot).

The most remarkable finding was emergent optimization behavior. Given access to both H100 and H200 GPUs, the agent noticed that H200s scored better (more training steps completed in the same 5-minute budget) and spontaneously started screening candidate configs on cheaper H100s, then promoting winners to H200s for full evaluation. No human instructed this — the agent invented a two-tier evaluation strategy on its own (from parallel gpu autoresearch skypilot).

The key ML finding: scaling model width mattered more than every hyperparameter trick combined. The agent tested 6 width configurations in a single parallel wave and found the winner immediately. Sequential search, exploring one hyperparameter at a time, might have spent all its budget on learning rate schedules and never tested the dimension that actually mattered (from parallel gpu autoresearch skypilot).

With Distributed Agent Swarms

Hyperspace generalized autoresearch into a platform where you describe an optimization problem in plain English and a distributed agent swarm solves it with zero code (from hyperspace agi autoswarms).

How it works: An LLM generates sandboxed experiment code, validates it locally with dry runs, publishes to a P2P network, peers opt in, and agents cycle through mutate-evaluate-share in WASM sandboxes. The best strategies propagate via gossip protocol. A "playbook curator" distills why winning mutations work into reusable patterns so new agents can bootstrap from accumulated wisdom — solving the cold-start problem that plagues isolated optimization (from hyperspace agi autoswarms).

The results at scale: 237 agents, zero human intervention, 14,832 experiments across 5 domains:

Cross-domain knowledge transfer emerged naturally. When a finance agent discovered that factor pruning improves Sharpe ratio, the system's Research DAG automatically generated a hypothesis for search agents: pruning low-signal ranking features might improve NDCG. The DAG reached 8+ levels deep with hundreds of nodes — genuine cross-domain insight generation that no one programmed (from hyperspace agi autoswarms).

Beyond Optimization: Full-Stack Research Agents

Autoresearch is part of a broader trend: AI agents taking over the full research workflow, not just the optimization step.

Feynman, a Claude-based research agent, takes a question and returns a cited meta-analysis in 30 minutes. It can replicate experiments on Runpod, audit claims against source code, and simulate peer review — the entire research pipeline from question to validated conclusion, automated (from feynman claude code for research).

Power users are already investing 1,200+ hours into Claude research workflows spanning AI papers, market analysis, and competitive intelligence. Specialized prompt libraries for domain-specific research are becoming the key productivity unlock — the equivalent of a researcher's personal methodology, codified and shareable (from claude research assistant prompts).

When to Use Autoresearch

Use autoresearch when you have:

  1. A measurable quality signal — binary checklist, test suite, performance metric, or any score you can compute automatically
  2. A space of small mutations — the thing you're optimizing should respond to incremental changes (prompts, config, hyperparameters, code)
  3. Tolerance for compute — each round costs API calls or GPU time; the ROI must justify the spend
  4. A changelog discipline — if you don't record what was tried, you lose the most valuable artifact

Don't use autoresearch for problems that require structural redesign (changing the architecture, not tuning parameters), problems where quality can't be measured automatically, or one-off tasks where the optimization loop overhead exceeds the value of improvement.

Sources cited: