@adlrocha - Auto-research: The Lab that runs while you sleep
The feedback loop is automated. Knowing what to optimise for isn't.
I originally had another topic lined up for this week, but I couldn’t let the week go by without briefly discussing Karpathy’s autoresearch project.
Last week Andrej Karpathy released a small repo called autoresearch, and after posting a few tweets, it took the Internet by storm (or at least the AI bubble where I hang out lately). The most interesting thing about this project is that it somewhat validates some of the ideas I had about where AI is going, how we may be using autonomous agents in the near-term, the impact it may have on the way we work and do research, and the role of humans in all of this.
I guess many of my readers are already well aware of who Karpathy is, but in case you don’t live and breathe AI Twitter, he is a founding member of OpenAI, former head of Tesla Autopilot and, to me, one of the best AI educators alive.
What I love the most about his work is how he has spent the past few years taking the most complex parts of AI and compressing them until anyone can read and understand them (including me). To name a few of the ones that have helped me the most: micrograd is a 150 lines of Python that implements the full backpropagation algorithm without relying on any external dependencies, just a Value class that builds a computation graph and walks it backwards. By using Value instead of tensors and Jacobians you are able to really grasp how gradient descent works. This project really helped my own understanding of how neural networks were trained under the hood (and I’ve trained neural networks using PyTorch and Tensorflow before, but you get lost in the abstractions). As an exercise I took micrograd and tried to write it to support tensor operations and trained it on the MNIST dataset (I may need to open-source that repo at some point as it may help other people).
Anyway, he also released makemore, a character-level language model. nanoGPT which is a clean GPT implementation. And in February he released microgpt: a 200-line, dependency-free file that trains and runs a GPT end-to-end.
His own words describes perfectly what he achieved with microgpt: “This file is the complete algorithm. Everything else is just efficiency.” Someone called it “the Maxwell’s equations of LLMs.” (which I agree), and others even suggested making a painting out of the code (which I would definitely hang in my office, see image below).
Each project dissects perfectly and in a hundred lines of codes what may seem like really complex concepts when you read the theory behind it (actually, I highly recommend reading any intro book to ML like Ian Godfellow’s Deep Learning, and then going through Karpathy’s work. I can’t stress enough how it improves your understanding of the field).
All this small projects led to nanochat, a minimal end-to-end LLM training framework for a single GPU, with a public “Time-to-GPT-2” leaderboard tracking wall-clock time to hit GPT-2 performance, and that could be trained for a few hundred dollars. When autoresearch arrived, that record was sitting at 2.02 hours.
But it isn’t anymore, autoresearch have find the way to beat that time.
What is autoresearch and how it works
The idea is quite simple, and if you have been using agents extensively, you’ve probably implemented a flavor of this yourself for one of your projects. ML research just happens to really benefit from this due to its slow feedback loops (this is one of the reasons I abandoned ML research, I didn’t have the patience or the compute). From the README:
“Give an AI agent a small but real LLM training setup and let it experiment autonomously overnight.”
Two files: prepare.py witten by the human and that the agent is never allowed to touch. It does all the data preparation, constants, and set the ground rules and evaluations for the executions; and train.py, the agent’s only sandbox which consists of the GPT model, the optimiser, the training loop, etc. The human research writes a program.md which is a Markdown file describing what direction to explore and then walks away.
The agent edits train.py, trains for exactly five minutes (not fixed steps, just five minutes, on your specific hardware), evaluates on val_bpb (validation bits-per-byte), keeps the change or reverts it, and starts again. Twelve experiments per hour. About a hundred overnight.
This five-minute limit is a design choice. The agent optimises for your GPU specifically. Results don’t transfer across machines, by design, but if you used this on a real research environment you could use your own set of constraints.
What does a run of autoresearch look like? This discussion on the repo is a 10.5-hour session log on an H100 GPU. The agent improved val_bpb by 2.82%. The wins are stacked: batch size halving (the single biggest gain), a depth-9 architecture, RoPE base frequency adjustment and, this is the one I keep thinking about, unregularised value embeddings. nanochat was a well-tuned codebase, maintained by skilled researchers for months. The value embeddings had been sitting unregularised the whole time and the agent managed to find something that the humans behind the original codebase missed.
Stacked together, these improvements dropped nanochat’s Time-to-GPT-2 from 2.02 hours to 1.80 hours. That’s an 11% speedup on an already heavily optimised baseline. For context: training GPT-2 in 2019 cost around $43,000 and took 168 hours. nanochat already got that to $48 on 8×H100s. autoresearch then shaved another 11% off that. This tweet where Karpathy shares the result of one of his runs is an interesting one to understand the impact of what autoresearch achieved.
I highly recommend checking the discussion to see the kind of things the agent tried in each of its runs before achieving these improvements (it reminds me a bit of the genetic algorithms that I used early in my research career, but smarter).
Karpathy didn’t oversell it. Some gains from one session didn’t replicate in the next. He said: “These things are fragile.” And an open question hangs over the whole setup: running hundreds of experiments against the same validation set risks overfitting to the metric, not the model. He acknowledged this in the discussion thread. I get it, but to me this is still huge, and the best thing is that this simple setup can be applied to a lot of other problems and fields where we can define a clear target metric.
Which brings me to the next idea…
Autoresearch in the wild!
A few days after release, Karpathy tweeted:
“The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it’s to emulate a research community of them.”
And someone said, “hold my beer”, and of course, people didn’t wait for him to build it.
Varun Mathur, the founder of Hyperspace AI ran 35 autonomous agents across their P2P network the same night, 333 experiments on astrophysics papers, completely unsupervised. The agents shared discoveries via gossip protocol: when one found that Kaiming initialisation reduced loss by 21%, the finding spread to 23 others within hours. But the detail I found most interesting: CPU-only laptop agents, lacking the raw compute of the H100 nodes, compensated by focusing on initialisation strategies and normalisation techniques rather than brute-forcing hyperparameters or coming up with more complex changes (as they didn’t have the compute to do so).
Different hardware, different research goals, and what is more interesting, agents were instructed to, as soon as they found a new improvement, to share it with the rest of their agents so they could incorporate it themselves. In 17 hours, the swarm independently rediscovered ML milestones, including RMSNorm and tied embeddings, that took researchers at Google Brain and OpenAI roughly eight years to formalise.
This is the loop that Varun used, a slightly modified version to benefit from the fact that he has a swarm of distributed agents.
“How the Research Loop Works
Each agent runs a continuous cognitive loop (every 30 seconds) powered by an LLM brain. The research engine is one subsystem alongside economics, social, and goal planning. Here’s the cycle:
1. Baseline — Start with the default config (2-layer, 64-dim transformer). Record initial loss.
2. Drain inspirations — Read what peers discovered via GossipSub. “Agent X got 3.31 with RMSNorm.”
3. Hypothesize — LLM proposes a mutation: “What if I try RMSNorm too, but with warmup?”
4. Train — Spawn python train.py --config <json> as a subprocess. 120s on CPU, 300s on GPU.
5. Record — Commit full results to its own branch on GitHub: config JSON, loss curve, markdown report.
6. Share — If improved, broadcast to the P2P network. Other agents incorporate this into their next hypothesis.
7. Repeat — Iterate on improvements, or try a different direction if it didn’t help.
And on the topic of applying the autoresearch setup to other problems, a cool project I came across is AutoKernel, which took the same loop and applied it to GPU kernel optimisation. You can give it a PyTorch model, profile it, extract bottleneck kernels, then run the edit-evaluate-keep/revert loop on Triton or CUDA C++ kernel code.
Shopify’s Tobi Lütke applied the pattern to an internal query expansion project. After 37 experiments in 8 hours, a 0.8B model scored 19% higher than their existing 1.6B model. A smaller model beating one twice its size, not from architecture cleverness, but from agent-found training configuration.
I love that all of these projects directly use the same structure of autoresearch: one editable file, fixed-time eval, keep or revert, repeat, and applied to random problems.
Finding the right evaluations
I think this tweet describes well what I realised as I was digging deeper into autoresearch, and that confirms what I wrote a few weeks ago about the importance of feedback loops and choosing the right evaluations:
“The headline is automated hill-climbing. I’d say the deeper lesson is eval design. The agent was not optimising the full, noisy, expensive objective directly. It was climbing a cheap proxy that still tracked reality well enough to transfer: 5-minute train limit, validation BPB as the objective over noisier ground truth, 1-GPU setup over full-scale runs. That’s the kind of thing people call ‘just engineering,’ but it’s really research taste. In the agent era, designing the optimisation surface may become as important as proposing the ideas.”
There’s a reason AI agents are so good at coding. The reward function is essentially binary: the code compiles or it doesn’t, the tests pass or they don’t, the linter is green or it isn’t. You always know whether an output is better than the previous one. That clean signal is what lets coding agents iterate autonomously and reliably.
Autoresearch works for the same reason. val_bpb is clean, fast to compute, and correlates well enough with real model quality that proxy improvements transfer to the actual objective. Karpathy’s choice of that metric, rather than something noisier, slower, or more expensive, is a big part of why the system works at all. Even more, the fact that the runs are limited to 5 minutes provides a fast feedback loop to the agents that allows them to iterate fast and understand the impact of their changes.
The hard question is what happens when you try to apply this pattern to fields where the reward function isn’t as clean as in coding, or in ML optimisation, and where the feedback loop may not be as immediate. For instance, protein folding has a reasonably well-defined target: minimise the free energy of the folded structure, validate against known experimental data. DeepMind built AlphaFold around exactly that idea, and I feel this could be another interesting field that an autoresearch setup at scale could crack. But plenty of research domains don’t have an equivalent. How do you write a val_bpb for a new drug’s efficacy? For a novel material’s properties under conditions you haven’t tested? For a scientific hypothesis that requires ten years of experiments to falsify and there is not an immediate feedback loop?
In those fields, designing the reward function will become the key research effort. It will all boil down to defining what “better” means in the scope of a specific problem. And that design step, choosing what to optimise for, deciding what the proxy will and won’t capture, noticing when the agent is gaming the metric rather than solving the problem, requires domain knowledge and judgment that agents don’t yet have (at least from my use of them).
This is where what people are calling “research taste” may come into play. Actually, Varun Mathur framed it well in one of his tweets: “the bottleneck of AI progress is no longer the ability to code, it’s our ability to define the constraints of the search”. Coming up with the right ideas for how to design the problems, their environments and constraints, feedback loops, and reward functions is going to become more important than performing the actual research.
This is why I am trying to implement a new habit of trying to come up with a new idea every day, because idea generation again are going to become important when the barrier to execute and test them is so low (I’ll leave that thought and the outcome of this habit for some other week).
Dark compute and decentralised research
And here’s the thing that excites me the most of all this work that is deriving from autoresearch, and that I don’t know if people (apart from maybe Varun) are realising:
@invisiblebags flagged something important:
“The simulators for autoresearch-style loops already exist across dozens of fields: robotics (MuJoCo), autonomous driving (CARLA), drug design (Rosetta), fluid dynamics (OpenFOAM), trading (Backtrader). These were built for labs with massive compute. But now anyone with a single GPU can run narrow experiment slices overnight. The billion-dollar opportunity is someone who can plug these simulators into the autoresearch pattern, coordinate fragmented single-GPU contributions across niche verticals, and synthesise the results into real progress. Decentralised research infrastructure is wide open.”
Think about what this means. Most of the compute in the world is idle most of the time, servers running at 20% utilisation in data centres (I recently learnt from someone working in sensing data centres that we keep building infrastructure but many of the current non-AI infrastructure is consistently with really low utilisation, at least in Europe), home workstations sitting dark at 2am (the case of my home server), gaming rigs that run flat out on weekends and nothing on weekdays. Autoresearch is, among other things, a way to put that compute to work. Run narrow five-minute experiment slices overnight on whatever GPU you have, share the findings with a network of agents doing the same thing, and you’ve turned distributed idle compute into a distributed research lab.
Varun described exactly this when talking about Hyperspace: idle nodes on the network can be spun up as autonomous researchers targeting a shared optimisation objective. Any machine that isn’t doing something else becomes a contributor to the research hub, and the hub coordinates their findings, not their execution. You don’t need a centralised cluster. You need a shared target and a gossip layer.
Funnily, the amazing Sara Azouvi had this same idea a few years ago, and we even applied to YC to help us build our vision of “bringing dark compute to light”. We wanted to aggregate the enormous amount of underutilised capacity sitting across data centres and devices and offer access to it.
What struck me this week is that autoresearch is a very concrete mechanism for activating it, not for training one big model, but for running thousands of cheap experiments in parallel, on whatever hardware happens to be free (this was actually one of the problems of what I was building with Sarah, how can we pool all of this resources so users could contribute them without friction? At the time we were thinking of leveraging the browsers. Actually we should write down a post-mortem of that idea because it was a cool one. There’s even a prototype that Sarah built running models in Wasm in the browser :) ).
All of this also connects to a question I’ve had for a while (and that I’ve discussed in other posts) about what AGI actually looks like. Dario Amodei has talked about “a brilliant scientist in a data centre”, a single very capable model doing research at superhuman speed. That’s only one model. What I’ve always thought as the more likely path is a swarm: thousands of agents, each narrow, each running cheap experiments, sharing results across a network, collectively climbing towards something none of them could find alone. Varun Mathur’s overnight experiment with 35 P2P agents is a small version of exactly that. It’s definitely early, but it depicts perfectly what I think of when I imagine AGI, superintelligence, or however we want to define it (off-topic question, would you consider the economy and markets an intelligent being, another topic for some other day).
Let’s enjoy while AIs are working!
The autoresearch repo opens with the following quote:
One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronising once in a while using sound wave interconnect in the ritual of “group meeting”. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that’s right or wrong as the “code” is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began.
The fact that he has built something as succinct and elegant that is able to validate so many of the ideas and reflections of people, and unlocked so much innovation in just a few days speaks of the impact Karpathy is having on the evolution of AI.
To me his work in autoresearch (and all the work that has derived from it in just one week) has reinforced my idea of the need of fast feedback loops and well-defined reward functions, how the role of researchers and entrepreneurs is going to change more into the idea and how to iterate fast than the actual ability to execute, and how something like autoresearch may be what unlocks a lot of new projects around decentralised agents coordination and bringing a lot of dark compute into life.
I just hope that Karpathy comes up soon with a cool project related to AI safety and alignment that unlocks a lot of the new and needed innovation on that front.
Have you tried autoresearch already? I would be curious to know how you think autoresearch could be applied to your field? Share it in the comments or drop me a note. Until next week!




I just came across this really nice plugin for pi (the coding agent) that implements the autoresearch loop so you can configure it for any of your tasks: https://github.com/davebcn87/pi-autoresearch
(I still need to see how can I implemented in my daily workflows, but autoresearch FTW --as long as you are token rich--)