Autoresearch
7 sources · Updated March 27, 2026
Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change, test against a binary checklist, keep if improved, revert if not, repeat. The method has been adapted from ML research to Claude skill tuning (56% → 92% pass rate in 4 rounds), parallel GPU experiments (910 experiments in 8 hours, 9x speedup), and even distributed swarm optimization (Hyperspace). The key insight: quality scoring must use binary yes/no checklists (3-6 questions), not vague ratings. The most valuable artifact is the changelog — institutional knowledge about what works. The pattern generalizes to anything measurable: prompts, page load times, ML hyperparameters, trading strategies.
Insights
Core Method
- Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
- Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
- The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
- Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)
Artifacts and Feedback Loops
- The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
- The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
- Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)
Scaling: Automated Skill Tuning
- Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
- Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)
Scaling: Parallel Execution
- Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
- Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
- Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)
Scaling: Distributed Swarms
- Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
- 237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
- A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)
Generalization
- The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)
Full-Stack Research Agents
- Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
- GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)
Research-as-Agent-Workflow
- Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)
- Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)