Autoresearch

AUTORESEARCH

19 SRC

19 sources Updated May 15, 2026

Autoresearch

Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change, test against a binary checklist, keep if improved, revert if not, repeat. The method has now been adapted from ML research to Claude skill tuning (56% → 92% in 4 rounds; 32/50 → 47/50 overnight in another), parallel GPU experiments (910 experiments in 8 hours, 9x speedup), distributed swarm optimization (Hyperspace), web automation (the /autobrowse skill iterating until it converges, then graduating the winning workflow into a reusable browser skill), and full end-to-end ML research (ml-intern beating Claude Code on GPQA 32% vs 22.99% by autonomously walking citation graphs, pulling datasets, reformatting them, launching training jobs on HF Jobs, monitoring runs, and retraining on failure). OpenAI's "Lord Bottleneck" demonstrates the bootstrap path: don't try to automate the whole pipeline upfront — accelerate individual tasks, connect the working pieces into a skill, then schedule it daily. Hermes-based research agents are the substrate for content/trading/sales/coding agents downstream.

The loop is now packaged as installable tooling: the Evo plugin (open source, for Claude Code and Codex) turns a codebase into an autonomous research loop — it auto-discovers metrics worth measuring, instruments benchmarks from codebase analysis, and runs tree search with parallel subagents to optimize performance. This removes the two highest-friction manual steps (deciding what to measure and wiring up the benchmark harness) that previously gated every autoresearch run.

The key methodological insights remain: quality scoring must use binary yes/no checklists (3-6 questions), not vague ratings. The most valuable artifact is the changelog — institutional knowledge about what works. The pattern generalizes to anything measurable: prompts, page load times, ML hyperparameters, trading strategies, equity-research synthesis (Perplexity Council pulls GS/JPM/MS/Evercore agreement-vs-disagreement views in 2 minutes), and personal research workflows (give the agent feedback in plain phrases — "more like this," "this source is noisy," "this is useful," "this is mid").

Guides

Insights

Core Method

  • Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
  • Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
  • The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
  • Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)

Artifacts and Feedback Loops

  • The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
  • The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
  • Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)

Packaged Tooling: Evo Plugin

  • The Evo plugin transforms a codebase into an autonomous research loop that automatically discovers metrics to measure and instruments benchmarks — eliminating manual benchmark creation by auto-instrumenting measurements from codebase analysis (from evo claude autoresearch orchestrator)
  • Evo runs tree search algorithms with parallel subagents to optimize code performance and research outcomes, and ships as an open-source orchestrator plugin for both Claude Code and Codex (from evo claude autoresearch orchestrator)

Scaling: Automated Skill Tuning

  • Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
  • Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)

Scaling: Parallel Execution

  • Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
  • Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
  • Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)

Scaling: Distributed Swarms

  • Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
  • 237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
  • A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)

Generalization

  • The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)

Full-Stack Research Agents

  • Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
  • GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)

Research-as-Agent-Workflow

  • Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)

  • Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)

  • At ~100 articles and ~400K words, LLMs can handle complex Q&A against a personal wiki without fancy RAG — auto-maintained index files and brief document summaries are sufficient for the model to navigate the space (from llm powered personal knowledge bases)

  • File query results back into the wiki after each research session — explorations and questions "add up" in the knowledge base, so every LLM interaction enhances future queries rather than disappearing into chat history (from llm powered personal knowledge bases)

  • As a knowledge base repo grows, the natural evolution is synthetic data generation + finetuning to have the LLM "know" the data in its weights rather than relying on context windows — a long-term trajectory for personal knowledge systems (from llm personal knowledge base workflow)

  • Monthly wiki health check: "flag contradictions between articles, find topics mentioned but never explained, list claims not backed by a source in raw/, suggest 3 new articles for gaps" — prevents error compounding when LLM outputs get filed back into the knowledge base (from nick spisak shared link)

  • LLM Council method (Karpathy-inspired): instead of trusting one model's answer, 5 advisors with different thinking styles argue the same question, then anonymously peer-review each other — catches blind spots no individual perspective reveals (from link share without context)

Self-Improving Skills Loops

  • The Karpathy autoresearch pattern is now operational for Claude Code skills: define 3-5 binary eval criteria, run the skill 10 times with varied inputs, evaluator scores every output, identify failure patterns, rewrite the prompt, retest, keep the winner — a hook-writer skill went 32/50 → 47/50 overnight; works for hooks, briefs, ad copy, scripts, reports (from claude code self improving skills automation)

  • /autobrowse skill (inspired by Karpathy's autoresearch harness): agent explores web pages via Browserbase CLI, learns from failed attempts, iterates until it converges on a reliable workflow, then graduates the winning approach into a reusable browser skill once token usage is optimized — autoresearch applied to web automation specifically (from autobrowse skill web automation agent)

End-to-End Research Agents

  • ml-intern automates the post-training research loop: arXiv reads, citation walking, HF dataset pulls, dataset reformatting before training (so it doesn't waste GPU hours on bad data), HF Jobs training when no local GPUs are available, run monitoring, eval-output reading, failure diagnosis, and retraining — beat Claude Code on GPQA (32% vs 22.99% in <10h) by finding OpenScience+NemoTron-CrossThink and running 12 SFT runs on Qwen3-1.7B (from ml intern automated research agent)

  • ml-intern recognizes when datasets are too low-quality and writes scripts to generate synthetic replacements — 1100 synthetic healthcare data points, upsampled 50x, beat Codex on HealthBench by 60%; full GRPO with autonomous ablation cycles (from ml intern automated research agent)

  • See Hermes Agent for the Hermes v0.12.0 research-agent recipe (pick domain → sources → signal → evidence vault → daily briefs → plain-English feedback) and the "research-as-substrate-for-other-agents" principle — those insights are primarily about Hermes; the autoresearch lens is the same Karpathy-style hill-climbing loop applied to domain monitoring.

  • Perplexity Computer's Equity Research Council pulls research from GS/JPM/MS/Evercore in 2 minutes and surfaces where they agree vs disagree — multi-analyst comparison as a 2-minute workflow, mimicking how top hedge fund PMs evaluate (from perplexity equity research council workflow)

Incremental Automation

  • OpenAI's "Lord Bottleneck" pattern: a growth-team staffer used Codex for individual experiment steps (analyze data, write experiment code, interpret results, produce deck), then chained them into one giant skill, then asked Codex "do this every morning" — the autoresearch loop bootstrapped from incremental task acceleration, not a top-down "automate everything" plan; produced significant company value (from openai lord bottleneck codex automation)

  • Don't try to automate the entire pipeline upfront — start with single-task acceleration, connect successful pieces into a skill once they work individually, then schedule it; naming the system ("Lord Bottleneck") makes it approachable and easier for teams to interact with (from openai lord bottleneck codex automation)

Voices

21 contributors
Ole Lehmann

Ole Lehmann

@itsolelehmann

I help non-technical people make more money with AI agents. AI connoisseur, robotics maxi, eu/acc supporter, dad, techno optimist

135.5K followers 2 tweets
Andrej Karpathy

Andrej Karpathy

@karpathy

I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.

2.0M followers 2 tweets
Alex Prompter

Alex Prompter

@alex_prompter

Marketing + AI = $$$ 🔑 @godofprompt (co-founder) 🎥 https://t.co/IodiF1Ra5f (co-founder)

91.0K followers 1 tweet
Cody Schneider

Cody Schneider

@codyschneiderxx

folllow for shiposting about the growth tactics i'm using to grow my startup building @graphed with @maxchehab Get Started Free - https://t.co/stXlkQBlSj

59.9K followers 1 tweet
Nick Spisak

Nick Spisak

@NickSpisak_

| AI Transformation Engineer | Seven Figure E-Commerce Business Owner

9.5K followers 1 tweet
George from 🕹prodmgmt.world

George from 🕹prodmgmt.world

@nurijanian

Can I make everyone a great product manager? I will do my best | Get my product management OS + AI skills for Claude Code/Cursor: https://t.co/ngCnvp77SD

43.8K followers 1 tweet
TBPN

TBPN

@tbpn

Technology's daily show. Hosted by @johncoogan & @jordihays. Streaming live 11a-2p PT every weekday. Sign up for TBPN's daily newsletter at https://t.co/Nhf5ohjInO.

301.1K followers 1 tweet
Advait Paliwal

Advait Paliwal

@advaitpaliwal

disciple of experience

12.6K followers 1 tweet
Aksel

Aksel

@akseljoonas

building AI agents @huggingface 🤗

1.3K followers 1 tweet
Alok Bishoyi

Alok Bishoyi

@alokbishoyi97

building something new. capital allocator @dzerovc. @iitbombay alum. tweets on AI, India, travel, lifting, NFL, occasional high-temp sampling.

2.9K followers 1 tweet
Aniket Panjwani

Aniket Panjwani

@aniketapanjwani

I teach agentic coding to economists || PhD Economics Northwestern || Director of AI/ML @ Payslice || ex-MLOps @ Zelle

4.9K followers 1 tweet
Graeme

Graeme

@gkisokay

AI agent enjoyer | Founder @amplifi_now | Building better agents

24.9K followers 1 tweet
Griffin Hilly

Griffin Hilly

@GriffinHilly

Bond trader by day, pro-humanism/nuclear on my own time. @MadiHilly’s biggest fan

637 followers 1 tweet
James Bedford

James Bedford

@jameesy

engineering @zerion, building https://t.co/3RQYuZAcrt

5.2K followers 1 tweet
Jeff Grimes

Jeff Grimes

@jeffgrimes9

Head of Live Events product at Perplexity. Working on Finance Computer & Perplexity Finance.

15.8K followers 1 tweet
Ellaa

Ellaa

@learnwithella

AI Enthusiast | Content Creator & Digital Publisher 📰✨ | Making complex tech simple. Follow for daily AI updates, prompt guides, and future tech! 🚀

245 followers 1 tweet
Zhanghao Wu

Zhanghao Wu

@Michaelvll1

Building SkyPilot @skypilot_org | Co-creator of @lmsysorg, PhD @Berkeley_EECS @ucbrise. Prev: @MIT, @sjtu1896

1.8K followers 1 tweet
rahul

rahul

@rahulgs

head of applied ai @ ramp

13.1K followers 1 tweet
Shiv

Shiv

@shivsakhuja

Pontificating... / Vibe GTM-ing / Making Claude Code do non-coding things building a team of AI coworkers @ Gooseworks / prev @AthinaAI /@google / @ycombinator

52.2K followers 1 tweet
Shrey Pandya

Shrey Pandya

@shreypandya

building @browserbase // @neo, @v1michigan 〽️

1.3K followers 1 tweet
Varun

Varun

@varun_mathur

Agentic General Intelligence @HyperspaceAI (Co-founder and CEO)

34.8K followers 1 tweet