AUTORESEARCH
19 SRC
Autoresearch
Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change, test against a binary checklist, keep if improved, revert if not, repeat. The method has now been adapted from ML research to Claude skill tuning (56% → 92% in 4 rounds; 32/50 → 47/50 overnight in another), parallel GPU experiments (910 experiments in 8 hours, 9x speedup), distributed swarm optimization (Hyperspace), web automation (the /autobrowse skill iterating until it converges, then graduating the winning workflow into a reusable browser skill), and full end-to-end ML research (ml-intern beating Claude Code on GPQA 32% vs 22.99% by autonomously walking citation graphs, pulling datasets, reformatting them, launching training jobs on HF Jobs, monitoring runs, and retraining on failure). OpenAI's "Lord Bottleneck" demonstrates the bootstrap path: don't try to automate the whole pipeline upfront — accelerate individual tasks, connect the working pieces into a skill, then schedule it daily. Hermes-based research agents are the substrate for content/trading/sales/coding agents downstream.
The loop is now packaged as installable tooling: the Evo plugin (open source, for Claude Code and Codex) turns a codebase into an autonomous research loop — it auto-discovers metrics worth measuring, instruments benchmarks from codebase analysis, and runs tree search with parallel subagents to optimize performance. This removes the two highest-friction manual steps (deciding what to measure and wiring up the benchmark harness) that previously gated every autoresearch run.
The key methodological insights remain: quality scoring must use binary yes/no checklists (3-6 questions), not vague ratings. The most valuable artifact is the changelog — institutional knowledge about what works. The pattern generalizes to anything measurable: prompts, page load times, ML hyperparameters, trading strategies, equity-research synthesis (Perplexity Council pulls GS/JPM/MS/Evercore agreement-vs-disagreement views in 2 minutes), and personal research workflows (give the agent feedback in plain phrases — "more like this," "this source is noisy," "this is useful," "this is mid").
Guides
Designing a Second Brain for AI Agents: The Vault-as-Database Pattern
How to architect a local knowledge base that AI agents can reason over — starting from Karpathy's three-folder reference architecture, through the three-layer memory stack, MCP bridges, quality maintenance, and the scaling path from flat files to SQLite.
What Is Autoresearch and How to Use It
Autoresearch is Karpathy's hill-climbing optimization loop — make one change, test against a binary checklist, keep or revert, repeat. This guide covers the core method, practical implementation with Claude skills, scaling to parallel GPU clusters and distributed agent swarms, and the emerging pattern of full-stack AI research agents.
Insights
Core Method
- Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
- Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
- The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
- Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)
Artifacts and Feedback Loops
- The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
- The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
- Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)
Packaged Tooling: Evo Plugin
- The Evo plugin transforms a codebase into an autonomous research loop that automatically discovers metrics to measure and instruments benchmarks — eliminating manual benchmark creation by auto-instrumenting measurements from codebase analysis (from evo claude autoresearch orchestrator)
- Evo runs tree search algorithms with parallel subagents to optimize code performance and research outcomes, and ships as an open-source orchestrator plugin for both Claude Code and Codex (from evo claude autoresearch orchestrator)
Scaling: Automated Skill Tuning
- Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
- Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)
Scaling: Parallel Execution
- Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
- Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
- Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)
Scaling: Distributed Swarms
- Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
- 237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
- A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)
Generalization
- The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)
Full-Stack Research Agents
- Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
- GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)
Research-as-Agent-Workflow
Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)
Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)
At ~100 articles and ~400K words, LLMs can handle complex Q&A against a personal wiki without fancy RAG — auto-maintained index files and brief document summaries are sufficient for the model to navigate the space (from llm powered personal knowledge bases)
File query results back into the wiki after each research session — explorations and questions "add up" in the knowledge base, so every LLM interaction enhances future queries rather than disappearing into chat history (from llm powered personal knowledge bases)
As a knowledge base repo grows, the natural evolution is synthetic data generation + finetuning to have the LLM "know" the data in its weights rather than relying on context windows — a long-term trajectory for personal knowledge systems (from llm personal knowledge base workflow)
Monthly wiki health check: "flag contradictions between articles, find topics mentioned but never explained, list claims not backed by a source in raw/, suggest 3 new articles for gaps" — prevents error compounding when LLM outputs get filed back into the knowledge base (from nick spisak shared link)
LLM Council method (Karpathy-inspired): instead of trusting one model's answer, 5 advisors with different thinking styles argue the same question, then anonymously peer-review each other — catches blind spots no individual perspective reveals (from link share without context)
Self-Improving Skills Loops
The Karpathy autoresearch pattern is now operational for Claude Code skills: define 3-5 binary eval criteria, run the skill 10 times with varied inputs, evaluator scores every output, identify failure patterns, rewrite the prompt, retest, keep the winner — a hook-writer skill went 32/50 → 47/50 overnight; works for hooks, briefs, ad copy, scripts, reports (from claude code self improving skills automation)
/autobrowse skill (inspired by Karpathy's autoresearch harness): agent explores web pages via Browserbase CLI, learns from failed attempts, iterates until it converges on a reliable workflow, then graduates the winning approach into a reusable browser skill once token usage is optimized — autoresearch applied to web automation specifically (from autobrowse skill web automation agent)
End-to-End Research Agents
ml-intern automates the post-training research loop: arXiv reads, citation walking, HF dataset pulls, dataset reformatting before training (so it doesn't waste GPU hours on bad data), HF Jobs training when no local GPUs are available, run monitoring, eval-output reading, failure diagnosis, and retraining — beat Claude Code on GPQA (32% vs 22.99% in <10h) by finding OpenScience+NemoTron-CrossThink and running 12 SFT runs on Qwen3-1.7B (from ml intern automated research agent)
ml-intern recognizes when datasets are too low-quality and writes scripts to generate synthetic replacements — 1100 synthetic healthcare data points, upsampled 50x, beat Codex on HealthBench by 60%; full GRPO with autonomous ablation cycles (from ml intern automated research agent)
See Hermes Agent for the Hermes v0.12.0 research-agent recipe (pick domain → sources → signal → evidence vault → daily briefs → plain-English feedback) and the "research-as-substrate-for-other-agents" principle — those insights are primarily about Hermes; the autoresearch lens is the same Karpathy-style hill-climbing loop applied to domain monitoring.
Perplexity Computer's Equity Research Council pulls research from GS/JPM/MS/Evercore in 2 minutes and surfaces where they agree vs disagree — multi-analyst comparison as a 2-minute workflow, mimicking how top hedge fund PMs evaluate (from perplexity equity research council workflow)
Incremental Automation
OpenAI's "Lord Bottleneck" pattern: a growth-team staffer used Codex for individual experiment steps (analyze data, write experiment code, interpret results, produce deck), then chained them into one giant skill, then asked Codex "do this every morning" — the autoresearch loop bootstrapped from incremental task acceleration, not a top-down "automate everything" plan; produced significant company value (from openai lord bottleneck codex automation)
Don't try to automate the entire pipeline upfront — start with single-task acceleration, connect successful pieces into a skill once they work individually, then schedule it; naming the system ("Lord Bottleneck") makes it approachable and easier for teams to interact with (from openai lord bottleneck codex automation)
Voices
21 contributors
Ole Lehmann
@itsolelehmann
I help non-technical people make more money with AI agents. AI connoisseur, robotics maxi, eu/acc supporter, dad, techno optimist
Andrej Karpathy
@karpathy
I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.
Alex Prompter
@alex_prompter
Marketing + AI = $$$ 🔑 @godofprompt (co-founder) 🎥 https://t.co/IodiF1Ra5f (co-founder)
Cody Schneider
@codyschneiderxx
folllow for shiposting about the growth tactics i'm using to grow my startup building @graphed with @maxchehab Get Started Free - https://t.co/stXlkQBlSj
Nick Spisak
@NickSpisak_
| AI Transformation Engineer | Seven Figure E-Commerce Business Owner
George from 🕹prodmgmt.world
@nurijanian
Can I make everyone a great product manager? I will do my best | Get my product management OS + AI skills for Claude Code/Cursor: https://t.co/ngCnvp77SD
TBPN
@tbpn
Technology's daily show. Hosted by @johncoogan & @jordihays. Streaming live 11a-2p PT every weekday. Sign up for TBPN's daily newsletter at https://t.co/Nhf5ohjInO.
Advait Paliwal
@advaitpaliwal
disciple of experience
Aksel
@akseljoonas
building AI agents @huggingface 🤗
Alok Bishoyi
@alokbishoyi97
building something new. capital allocator @dzerovc. @iitbombay alum. tweets on AI, India, travel, lifting, NFL, occasional high-temp sampling.
Aniket Panjwani
@aniketapanjwani
I teach agentic coding to economists || PhD Economics Northwestern || Director of AI/ML @ Payslice || ex-MLOps @ Zelle
Graeme
@gkisokay
AI agent enjoyer | Founder @amplifi_now | Building better agents
Griffin Hilly
@GriffinHilly
Bond trader by day, pro-humanism/nuclear on my own time. @MadiHilly’s biggest fan
James Bedford
@jameesy
engineering @zerion, building https://t.co/3RQYuZAcrt
Jeff Grimes
@jeffgrimes9
Head of Live Events product at Perplexity. Working on Finance Computer & Perplexity Finance.
Ellaa
@learnwithella
AI Enthusiast | Content Creator & Digital Publisher 📰✨ | Making complex tech simple. Follow for daily AI updates, prompt guides, and future tech! 🚀
Zhanghao Wu
@Michaelvll1
Building SkyPilot @skypilot_org | Co-creator of @lmsysorg, PhD @Berkeley_EECS @ucbrise. Prev: @MIT, @sjtu1896
rahul
@rahulgs
head of applied ai @ ramp
Shiv
@shivsakhuja
Pontificating... / Vibe GTM-ing / Making Claude Code do non-coding things building a team of AI coworkers @ Gooseworks / prev @AthinaAI /@google / @ycombinator
Shrey Pandya
@shreypandya
building @browserbase // @neo, @v1michigan 〽️
Varun
@varun_mathur
Agentic General Intelligence @HyperspaceAI (Co-founder and CEO)