@adlrocha Beyond The Code

@adlrocha - The Real Cost of Using AI in 2026

adlrocha — Sun, 28 Jun 2026 08:01:29 GMT

A few weeks ago I wrote about the shift from GPU-poor to token-poor. Since this post, and the ones I wrote about my recent obsession with AI independence, a lot of people have asked me for advice about how they should access intelligence: “fine, but what should I actually do? Buy a subscription? Pay per token? Build a rig? Rent one?” I dodged the economics in the token-poor post, so let’s do them properly now.

The first thing I did before writing this post, is to pull my own token bill for the last 60 days (which have actually been slower than usual) in order to model my own token consumption (sidenote: BI built a really cool tool for this in nibble, my agent harness, that I can talk about in coming posts if someone is interested).

91% of my token spend went to expensive models I cannot run at home (and that I never will because of their size and them being closed). But I already held a yearly subscription, so why not use those first? The open models I “could” (and let me add quotes here for now) host myself, the Qwens, DeepSeeks, the GLMs and the Kimis, cost me around $30 over two months. Just. Thirty. Dollars.

This gap is what actually motivated this exercise and my whole post. What if I didn’t have that Claude subscription? How much would’ve cost me my access to intelligence, and what are the alternatives?

The four ways to buy intelligence

There are exactly four ways (that I could come up with) to get tokens out of a large language model.

The first (and probably, the most widely used) is a subscription. You pay a flat monthly fee, and someone else runs the model. This is the ChatGPT Plus, Claude Pro, or Kimi/GLM coding plans. Simple, capped, predictable, and as we’ll see, heavily subsidised.

The second is pay-per-token via an API. No flat fee, you pay for exactly what you consume, priced per million tokens in and per million out. This is generally what you use when you need to power your application with AI, or when you route an agent through a serverless LLM provider like OpenRouter, Fireworks, Together, etc.. I’ll let you correct me in the comments, but I would say people leveraging agents on their day-to-day prefer the predictability of subscriptions than paying-per-token.

The third is renting a GPU in the cloud. You spin up a machine by the hour, load whatever open-weight model you like, and serve it yourself. RunPod, Vast, Lambda. You’re not paying for tokens, you’re paying for raw compute time. With this you don’t need to think about the amount of tokens you are consuming anymore.

The fourth is owning the hardware. You buy the silicon, it sits in your house, and the marginal cost of a token is your electricity bill. As you know, this is the quest I’ve been on for a while now.

I really think each of these wins in a different regime. The tricky thing is knowing which regime you’re in.

The numbers, at real usage

For this exercise, I tried to be as objective and practical as possible. Let me put my own usage through all four and show the maths. My pattern is moderate and mixed: coding, writing, research, a few hours a day, heavy on context. I don’t have token-heavy long-running loops, and all my LLM-powered crons are routed through my local Qwen (not considered for this analysis). From all my consumption, there are roughly 78 million input tokens and 13 million output tokens a year on the replaceable tier, the open models I could plausibly self-host. For personal and professional reasons, this month has been slower than usual in my use of tokens, but it allows me to set a good baseline floor of my usage.

Here’s what that year costs, four ways:

Pay-per-token API: DeepSeek V4 Flash at $0.14 in / $0.28 out comes to about €13 a year. At my current messier mix of open models, call it €130. Either way, low triple digits at most.
Cloud GPU rental: an MI300X with enough memory to hold a 100GB model runs about $1.99 an hour on RunPod. Realistically, we would need at least 200GB of VRAM to run something of the level of the open-models I use through the API. At ninety hours a month that’s roughly €2,300 a year, plus storage so you don’t re-download the weights every cold start.
Own hardware: a usable DIY server is around €2,900 up front for one GPU, and after that maybe €30 a year in electricity. I’ve been looking to build myself an AMD-based rig with at least 4 GPUs equivalent to the RTX3060 or RTX5090 and that takes you to around €12,500. A pair of DGX Sparks, the configuration people actually want, is €9,600. I also love the tinybox red v2, but that’s $12,000 for only 64GB of RAM (and I am not sure about how upgradable it is with a lot of tinkering).
Subscription: whatever your flat fee is, for models the other three options can’t touch at this quality.

Let’s look at those numbers for a second. The API is cheaper than everything by two orders of magnitude. Cloud rental is the worst option on the board, because at ninety hours a month you’re using the machine 12% of the time and paying as if you owned it, but it is true that you don’t have to pay a lot upfront. And the hardware saves you, against the API, almost nothing. I computed the break-even on a €2,900 rig versus a triple digit a year of API tokens, and it is measured in decades (not great).

If your usage looks anything like mine, where I have bursts of high taken consumption, and then calmer periods that I use to focus and think on things that do not require that many tokens, then the decision (at least today, and only looking at current 2026 numbers) is pretty straightforward: pay per token, keep the subscription for the smart stuff as long as it is subsidised, and don’t build anything yourself.

So why am I still thinking about building something myself?

Why subscriptions are the deal in the AI trap

First, let’s chat about something that everyone is talking about, but I want to be explicit about: at today’s prices, the subscription is a gift, and you should take them (I don’t know how to make this bolder, I was tempted to highlight and use a red font).

Using frontier models per output tokens costs real money. Claude Opus is $25 per million out, Fable is (was) $50. If you actually metered a heavy month of frontier chat at API rates, it would dwarf a $20 or even $200 subscriptions. We are being subsidised, and the size of it is startling once you read the analysis that others have done. David Rosenthal, citing SemiAnalysis, shows how for $200 a month you can burn $8,000 in Anthropic tokens or $14,000 in OpenAI tokens. That’s a subsidy of 40 to 70 times the price you pay per token. He calls it the drug-dealer’s algorithm: give the product away until the customer is hooked, then find the price later (you’ve probably read me say that, “we are not in an AI bubble but an AI trap”). OpenAI reportedly turned $13 billion of revenue into $34 billion of costs last year, so “later” is doing a lot of work. A trap built into a bubble (another of those cool sentences that people would assume are LLM-generated but that I came up with myself. This is me giving myself some self-kudos :) ).

But is building your own inference infrastructure the solution for this? Dylan Patel (founder of SemiAnalysis) thinks that local hosting will never be an option, as it needs far more silicon than shared infrastructure. A provider runs one GPU hot across thousands of users and amortises it across all of them. You at home run one GPU at 10-30% utilisation. The subsidy you enjoy is partly just better economics that you physically cannot reproduce alone, and part this “drug dealer algorithm”. Big labs are not covering costs right now, and someone building infrastructure at home is competing with the bidding power of these big pockets (which is what is happening to me). What the hell, even Apple had to raise prices this week due to (allegedly) the AI mania.

So my immediate recommendation, all other concerns aside, take the subsidy while it lasts. It’s real money in your pocket today, and refusing it on principle is just leaving value on the table. I was so afraid of raising subscription prices since last year that I’ve been hedging my access to intelligence by buying yearly subscriptions. They could change the quotas or the models, but at least I had access for one more year. And this is something I am still doing when it’s time to renew.

But I would also advise you to not build your life on the assumption these prices last. The cheap subscription is customer acquisition, not the steady state, and the history of every platform that ever ran below cost to win a market tells you what comes after the market is won. I really hope that I am wrong, and the economics of AI inference change, but you should either be prepared to pay more for your subscription, or find a way to hedge your access to intelligence at a reasonable price.

When owning actually makes sense

Here’s where I have to be honest with myself, because the maths clearly say that it is not a good idea to build an inference infrastructure yourself, but I keep wanting to build it anyway.

I realised owning hardware is not a cost decision, it is the AI version of “owning Bitcoin is not a good investment”. At normal usage it never pays back at current subscription and token prices, and anyone telling you it does is selling something or hasn’t run their own numbers. It is a value decision. You buy it for three things the cloud can’t sell you: privacy, independence, and tokens with no meter on them.

The good news is that the floor price for that has fallen hard, because the open models got genuinely good. A 128GB unified-memory box, a Ryzen AI Max+ machine or a high-end Mac, starts around €2,500 and runs models that were complete science fiction a year ago (but we’ve got too used to smart models). DeepSeek-V4-Flash-REAP-180B, a pruned and quantised version of a 641-billion-parameter model, fits in 97GB and runs on a single such box at 14 to 24 tokens a second. One person who switched to it described Qwen3.6-35B as great for small projects but falling apart past 100K context, then said of the DeepSeek model: “this thing is actually smart, coherent at long context, just works.” Unsloth recently squeezed GLM-5.1 by 85% so it can run on a 256GB Mac. And I recently came across Step-3.7-Flash that looks great.

The kind of open-source models that you can run on less than 10k of compute are still quite limited. But the open-source and local inference community are moving super fast, and less than five grand, today (let’s see what happens to hardware prices a few months from now), buys you a private model on your desk that handles most of what isn’t frontier work. And I am not talking about a toy like “how cute, I can run an LLM at home”, but an actual working tool.

Additional disclaimer, I am looking at this as an individual. If I was doing the analysis for my own company, where I had the funds for proper hardware, and the infrastructure would be shared by other people (or is the core of my product), then I would definitely lean towards local (or cloud) inference.

The real hedge: own some compute anyway

So here’s the practical conclusion of my exercise after running the math.

Owning a decent local box is insurance, not an investment. Insurance is supposed to look like a bad deal on the spreadsheet (like Bitcoin). You don’t buy it expecting a return, you buy it because the thing it protects against is catastrophic and you can’t fix it after the fact.

And the things this local box will protect you against are getting more real than a few months ago (I’ll let you call me paranoid in any case, many have done it already, including my wife). Models get censored or quietly lobotomised on whole categories of topic (right, Fable?). Providers deprecate the exact version your workflow depends on, and the replacement behaves differently. Or you simply lose access in your geography: export controls tighten, a provider decides the EU isn’t worth the compliance cost, a government blocks an API or forbids a model. Anything. When any of that happens, the price of the token stops mattering, because the token is no longer for sale to you at any price.

This is why I bought a Strix Halo a year and a half ago, before any of this arithmetic made sense, when it looked like an expensive toy. It was because if access vanishes tomorrow, I still have a private, uncensorable, good-enough model sitting on my desk that nobody can switch off. A €2,000 box is a cheap hedge against a scenario that costs you everything if it lands and you aren’t ready. I think that case holds for a lot more people than the “build a rig” case does (and in retrospect, this was an amazing investment also, because with the increase in the price of hardware, that same box costs today close to $4,500. See below).

But all prices are going up

There’s an uncomfortable tension running underneath all of this, and I don’t have a clean way out of it.

The case for owning your own compute gets stronger every month, as the access risks above get more plausible. But the cost of acting on it is climbing just as fast. We’re in the middle of a DRAM supercycle: memory contract prices rose 90% in the first quarter of this year, another 58 to 63% in the second, and the analysts don’t see it normalising before 2028. Memory is roughly 80% of a GPU’s bill of materials, so this hits everything.

You don’t need the analyst reports to feel it. 0xSero, who does this for a hobby, put it well: “Since I’ve gotten into this expensive hobby all compute has 1.5x and I don’t see the end in sight.” This has also been my experience. In one weekend update an RTX PRO 5000 went from $7,999 to $8,799, and 5090s jumped a few hundred dollars each. My own build plan is already out of date: the RAM line I priced at €250 a few months ago is €400 to €600 now, and the used GPU I had at €674 is €800 to €950. I wrote down the prices and they expired before I acted on them. I really need to pull the trigger for this compute purchase.

So where did I land, for myself? I’m not building the big rig yet. I already own the Strix Halo, which is roughly a single DGX Spark in disguise. I was thinking about buying a Spark, but it would be paying €4,800 for a second copy of what’s already on my desk. The plan is to stay on the subscriptions, pay pennies per token for the open-model work, run the genuinely private work locally on the hardware I already have, watch the memory market, and buy the upgradable platform later when prices ease. I have a full plan of how to build an upgradable rig that can adapt more cleanly to how the technology evolves than a sealed appliance like my Strix Halo or the DGX Spark (happy to share it in a follow-up post if someone is interested).

Jevons and the meter

There’s one more thing that really bothers me about basing this analysis on my current use of AI.

I unconsciously ration tokens.

I noticed it while reading my own bill. I reach for the cheaper model on a task that deserves the better one. I don’t fire off the five-parallel-agent run I know would help, because I can see the meter ticking. I am, in the small and constant way, behaving like someone who’s token-poor even though I can probably afford it. The meter is definitely influencing my behaviour. I know this is not rational, but it is what it is, I am cheap.

And I’m not the only one. Dan Davies makes the John Henry argument: “the moment large companies start telling their employees to use AI tokens wisely, the game is up, the race is over, and John Henry beats the steam hammer.” His point is about jobs, that if a token is too expensive to spend freely then it can’t be cheap enough to replace a person. But the same signal shows up at my desk. The story we were sold was inference too cheap to meter. The reality is that everyone, from me to the enterprise with a budget, is quietly rationing.

Everyone is calling the Jevons paradox in software, but I call it in my own use of tokens. For those unaware of that paradox, in 1865 William Stanley Jevons noticed that as steam engines got more efficient, England burned more coal, not less. Cheaper-to-use meant more-used, by a wide margin. Metered pricing does the reverse to me: it suppresses demand I would otherwise have. Which means the real prize of a machine in the corner that costs nothing per token might not be the money at all. It might be that it removes the meter, and lets me find out how much intelligence I’d actually use if using it felt free.

And here’s my honest uncertainty: I don’t know the real number of my token consumption. I’ve never run unmetered. I genuinely have no idea whether my real demand for this stuff is double what I currently spend, or ten times, or whether I’d burn out on it in a week. The whole independence project might turn out to be less about escaping a price and more about escaping a habit of self-censorship I didn’t know I had.

But with this in mind, I decided that the next few months I am going to run as if I owned the hardware, letting the token bill go to wherever it gets so I can do a more apples-to-apples comparison between the infrastructure I am trying to build, and they real use of tokens that I would have if I could run unmetered.

How about you? Do you catch yourself rationing the better model to save a few cents, and what do you think you’d do if the meter simply wasn’t there? Or are you using one of those alleged Chinese token pools? (I couldn’t resist sharing that one before closing this week’s post).

Until next week!

@adlrocha - Form Before Data: The Real Bottleneck for Physical AI

adlrocha — Sun, 21 Jun 2026 08:01:59 GMT

A reader messaged me last week with a question about a topic that has been in my backlog for a few months now, AI and the physical world. The request was the following “Can you elaborate on the rate of adoption of AI for the physical world? We see [it] operating almost entirely in the digital realm. The Tesla FSD vehicles are examples of AI moving in the physical world. We are also beginning to see other machines such as humanoid robots move through space by interpreting the visual field. But these examples are still very uncommon.”

He’s right, and while “still very uncommon”, the field is making progress fast. We have AI that writes code, drafts contracts, and passes the bar, and we have a handful of cars and factory robots, and almost nothing in between. But my feeling is that the gap isn’t intelligence, I think the models and foundational technology is there. What we are missing is the right “body” and “senses” for the model to make sense of the world, and the data needed to teach it how to navigate it.

That’s the thesis I want to make the case for in this post. Tesla cracked self-driving first not because its models were the smartest (until quite recently they were using traditional visual pattern recognition models instead of using deep-learning end-to-end), but because the car was already the right shape to act in the world for their specific task. It rolls, it steers, it has somewhere to put cameras. The form of the robot and the actions it had to perform in the physical environment were already well-defined. Everything physical AI does next is a search for that same fit: the right form for each task, and the intelligence to drive it through a messy, imprecise, badly-lit world that no simulation fully captures.

Why the car came first

A car is a strange thing to call an autonomous robot, but that’s essentially what a self-driving car is: a machine that senses its environment and acts in it. And it turns out to be an unusually “easy” (big quotes) robot. It moves in two dimensions. It has four contact points with the world and they never change. It can’t fall over, it can’t drop anything, and the rules of the road are written down. Compare that to a hand picking up an egg, where success depends on grip force you have to feel rather than see, and you start to understand why driving fell first. The car was already the right shape for the job, and the operations it could perform and its core goals were well-defined. Similarly to how LLMs cracked coding first because there was an objective feedback loop to optimise, car was the obvious one (in retrospective) for AI in the physical world.

Everyone (I hope) that owns a car knows how to drive it. Tesla managed to ship an attractive EV that people would buy, drive, and collectively pull the real-world data required to eventually teach an artificial brain how to autonomously drive one of these robots with wheels. Having access to all of this raw data of the physical world in virtually all possible kinds of scenarios, environments and locations, is what has enabled Tesla to finally crack SFD.

Tesla’s fleet has now passed 10 billion miles driven on FSD, adding roughly a million miles a day. Since FSD v12 the system has been a single end-to-end neural network, vision only, with the old hand-written rules torn out. It learned to drive by watching the fleet drive.

Notice the order of operations. Tesla didn’t sell cars and then bolted on autonomy as a side project. The car was the data-collection programme. Every vehicle on the road has eight cameras recording how real people handle real roads in bad weather, and that stream is what trains the next model. They built the perfect sandbox environment and data flywheel to train their self-driving models. Waymo, with better sensors (because Tesla only uses visual sensors) and a smaller fleet, has spent years unable to (so far) out-engineer Tesla’s simple advantage.

One of the reasons why Tesla could build this flywheel is because the “body”, “environment”, and “rules” for these robots were pretty well-defined. You cannot collect ten billion miles of driving data without ten million things shaped like cars already driving around. Form is the precondition for data, not the other way round. That is the move every physical-AI company is now trying to repeat, and it is much harder when the task is folding laundry instead of staying in a lane.

If we treat Tesla cars as the first instance of autonomous physical robots, I think there’s a lot of learnings that we can extract and immediately apply to the field of robotics.

The two things a body still can’t do

If form is the precondition, the obvious question is why we don’t already have the right forms everywhere. The bodies seem to exist. Figure, Optimus and Unitree all walk, balance and grasp, and bipedal locomotion that took the field decades is close to a solved engineering problem. So what’s missing?

Two things, and neither is intelligence in the abstract sense. The model can already plan. What it can’t reliably do is feel, and what we can’t yet cheaply do is teach it the specific task. While we can consider a car like a “narrow body” for a “narrow task”, I feel like humanoid robots are a general-purpose form factor to whom we could teach a great gamut of tasks that we already do ourselves. We just need to teach them how we do it.

Driving is a vision problem, and vision is the sense AI is best at and one of the first ones it was able to crack (do you remember the amazing things that convolutional neural networks, a.k.a CNNs were, able to do a decade ago?). Folding a shirt is not. It needs touch, the kind that adjusts grip force when a fabric starts to slip, and dexterous tactile hands are the part of the body that still lags the rest. The economics show where the difficulty sits: actuators alone run 30 to 40% of a high-end humanoid’s bill of materials, a single high-torque actuator costs thousands, and there are twenty or forty per robot. A car needs a steering rack. A hand that works by feel needs a torque-controlled actuator at every knuckle, and the supply chain for those parts is not built for volume yet. The right form for manipulation is genuinely harder to build than the right form for driving. The number of actuators that the models need to be able to activate for a specific action is significantly larger than in the case of a car.

Then there’s teaching them how to perform a task accurately. Large models learned language by reading the internet, text humanity had already written and left lying around for free. There is no equivalent corpus for physical action in the physical world in almost every possible environment and scenario (the kind of data corpus that Tesla collected through more than a decade of people driving their cars). No website stores the exact sequence of joint torques and micro-corrections in threading a cable behind a desk or lifting an egg without crushing it. That data has never been recorded and translated in the “senses” incorporated into these humanoid robots.

The workaround that many have tried is to use simulations of the physical world: like in RL environments, you let the model run a billion virtual attempts overnight to train. Tesla also did this for some years. It works right up to the sim-to-real gap, the point where the policy meets a real machine and the friction is slightly off, the actuator lags a few milliseconds, and the object deforms in a way the simulator never modelled. For humanoids that gap is wider than for four-legged robots, because more joints and more contact mean more ways for a small error to compound into a fall, and there is no general fix. Every team patches it by hand.

Put the two together and you get the real state of physical AI in 2026. The model is smart enough and the technology is there. The body can move. What’s missing is a body that can feel its way through a task it has never seen, and the mountain of real-world demonstrations needed to train it. That is why all those cool slick backflip videos from robots are just a demonstration of the form factor and specific actions being cracked, not a deployment. While we are getting close, we still are teaching these general-purpose robots how to perform specific tasks in different scenarios of the physical world.

Humanoids in 2026

The humanoid is the form that gets all the attention, because they are cool and they can move like us. With all that said about how hard manipulation is, the deployments are real, more real than I expected when I started looking.

Figure has robots on the line at BMW’s Spartanburg plant, running ten hours a day at better than 99% placement accuracy across more than a thousand operational hours. Tesla is targeting 50,000 Optimus units in 2026 inside its own factories, and Unitree will sell you a G1 today for around $16,000, the Model 3 of robots: cheap hardware at volume, worry about generality later (I see the “Tesla pattern” here of deploying the platform that enables data collection at scale).

But notice what all three have in common. The tasks are narrow: load this panel, place this part, etc. These are factory jobs in structured spaces, the closest a robot gets to a car’s nice clean lane. The robots are supervised, single-purpose, and a long way from the general-purpose machine that tidies your house. Map them onto the Tesla timeline and we’re roughly where the fleet was a decade ago: the hardware is out gathering experience, the autonomy is years of data away, with the caveat of the form factor.

And here’s the thing the humanoid hype obscures: for most of these jobs, the metal human is the wrong form. They could be using robots that are closer to what factory robots look today, but by using a humanoid form factor, we have a general-purpose platform that can be taught any kind of task that a human could do.

The robots that are working do not look human

Robots have been around for decades. China runs an operational stock of around two million industrial robots, and Xiaomi builds ten million phones a year in a lights-out factory at 81% automation, but all these robots operate through a specific script, the logic is hardcoded. The task and the operation needs to be clearly hardcoded and implemented, like it was the case of the early Teslas.

The thing actually changing in 2026 is that we are starting to see robots whose behaviour comes from a learned model instead of a fixed program, machines that can handle a situation nobody scripted in advance. That is what “AI for the physical world” means to me, and the most successful instances of this so far do not wear a humanoid shape. It shows up first in the jobs where the body is simple but the world is messy, which is exactly where old automation couldn’t go.

Agriculture is the clearest example. A fruit-picking arm sounds may seem something that can be implemented with classical automation until you look at what it has to do: find a ripe strawberry behind a leaf, under changing light, half-occluded by another berry, and decide in real time whether to pick it. That is a perception problem, and it’s being solved with the same deep-learning vision stack as everything else. Recent harvesters run models like YOLO-based ripeness detectors trained for occlusion and light changes in real orchards, and dedicated ripeness networks that judge a blueberry the way a picker would. The arm is dumb. The eyes are not. John Deere’s autonomous tractors carry a sixteen-camera vision rig and a perception model that reads the field as it goes, rather than following a pre-mapped line. The number of fruit farms running autonomous harvesters jumped from 950 in 2021 to over 4,300 in 2024, and that curve is bending now because they can now interpret the real world, not because anyone invented a new arm.

The deeper shift is the arrival of foundation models for action, the physical-world equivalent of GPT (something I was completely unaware of). These are vision-language-action models: you give them camera frames and an instruction in plain language, and they output motor commands. RT-2 was the first to show real generalisation, jumping to 85% success on objects it had never seen, versus 60% for the previous generation, by training on internet images and robot trajectories together. π0 is a 3-billion-parameter action model that runs fast enough to control a real arm in real time, and by 2026 π0.6 and Google’s Gemini Robotics are the state of the art, with Alibaba’s open-weight Qwen-RobotManip arriving this month and topping the generalist benchmark, the same move they made for LLMs they are trying to make in robotics, open-sourcing the frontier. Before, the question was “can a learned policy work at all?” and how we are turning into asking ourselves “how do we make it reliable in the wild?” With foundational models for robotics like the latest ones from Qwen, the intelligence is becoming portable, and the body is becoming the interchangeable part.

So to directly answer the question from our reader, the robots already running at scale aren’t AI, but the foundation for it is being built. The robots that are genuinely AI, the ones reading a vine or grasping an object they’ve never seen, are real but young, and they’re appearing form-first: a simple, task-shaped body wrapped around a perception-and-action model that does the hard part. Pre-programmed automation needed a world held perfectly still. The new robots are the first that can cope with a world that won’t.

Where’s the value accrual right now?

If physical data has to be manufactured, the most valuable thing you can build is a platform that enables this data collection at scale.

The most vivid version of this is almost funny when you first hear it (the videos I’ve seen about this are extremely disturbing): companies are paying people to do their own chores on camera. DoorDash launched “Tasks” in March, paying drivers up to $25 an hour to film themselves doing housework, and a startup called Micro1 has a thousand people across sixty countries wearing iPhones strapped to their heads while they cook and clean. China is running the same playbook at state scale.

This is the Tesla flywheel, with a human where the car used to be. You can’t scrape demonstrations of physical work, so you pay humans to generate them. The car was the sensor. Now the person is.

The higher-fidelity version of the same idea is teleoperation: instead of filming a human, you have a human drive the robot directly, so every recorded action is in the robot’s own body, with real contact and real error-recovery and zero gap between the demonstrator and the machine. It’s the fastest-growing category of robot training data this year, and it’s the cleanest expression of the whole thesis: humans in the loop now, autonomy later. There are even VCs that are forming their whole robotic thesis around investing in companies that are focusing on building this teleportations as platforms for physical world data collection:

What this approach provides is not only to have a human expert solving a task remotely, but also to iterate on the right body form to solve the task, and of course collect real world data to then train the models.

The cost of everything in the physical world goes to zero?

According to the wide-spread narrative, knowledge work is collapsing fast. Opus is commoditising coding, open-weight models like GLM are now commoditising the frontier labs that did the commoditising. The barrier to producing a working piece of software, a passable legal draft, a defensible piece of research, has fallen close to zero. Anything that runs on a computer, the argument goes, is heading the same way, and physical work is next on the list once the robots arrive (faster even if it has ever been available on the Internet).

Software collapsed fast because it was software. Bits copy for free, so once a model can do the work, distributing that ability costs nothing. And the training data already existed: decades of code on GitHub, decades of writing on the open web, sitting there waiting to be ingested. Free distribution plus pre-existing data is a recipe for AI success, virtually unlimited data (although I would claim that we are running out of it).

The physical world has neither property. You can’t copy a robot for free, and there is no GitHub of physical actions to train it on. So yes, the same collapse is coming for physical work, but following classical physics and not the speed of the light like was the case of software (sidenote: many of you will assume that this sentence comes from an LLM, but I wrote it myself, and I am so proud of it that I am leaving it, and I don’t give a shit what you think, this is still human-written). It will be slower, and the slowness is structural. It’s always harder (and slower) to do things in real life, when you interact with a real environment under classical physics.

And it won’t arrive all at once. The “physical work goes to zero” framing misses that the collapse will be task by task, ranked by two things: how easy the right body is to build, and how messy the world is allowed to be. The most structured work already went to old-style automation that needs no intelligence: the caged factory line, the warehouse conveyor. Semi-structured outdoor work like fruit-picking and grid inspection is falling now, over this decade, because the body is simple and the vision models have finally caught up to the mess.

The unstructured human environments, the home, the hospital ward, the building site with people walking through it, are last, and “last” here means well past the three-year window the reader asked about. Electricians and plumbers are safe for longer than software engineers, not because their work is more skilled, but because their work is harder to give a robot the right body for, and impossible to scrape off the web. And beware, because you can already see real deployments in China that I mentioned above (I highly recommend everyone to search for China and robots in youtube to see the cool stuff they are building, from cleaning robots for solar panels, to last-mile delivery trucks).

But the barrier will fall eventually. That is the part I don’t want to undersell. Every constraint I’ve described in this post (no internet of physical actions, brittle hands, the sim-to-real gap, the actuator supply chain) is a problem with a known shape, and known-shape problems get solved on long enough timescales. The chore-filming gig economy is ugly, but it is a corpus being built. Teleoperation is awkward, but it is the cleanest demonstration data we’ve ever had. Qwen’s open-source action model is half marketing, but it is the first credible attempt to skip the collection step. The barrier is high. It isn’t infinite. The interesting question is what happens in the meantime.

So what’s next?

To answer the reader’s question, I think we are in the platform-building phase of physical AI. The platform phase is where the infrastructure for collecting data and iterating on policies is being assembled in public, and almost everything you see in a press release is really a wrapper around that. The teleoperation rigs are platforms. The narrow harvesters and tractors are platforms. The chore-filming apps are platforms. Even the foundation-model labs are platforms in disguise: their real product is the loop that turns demonstrations into policy, and the robot is a customer of that loop. The next three years are not about robots getting good at jobs. They are about building the sandboxes in which robots and the humans who work with them can figure out, slowly and in public, what those jobs even look like. We are going to start seeing some real deployments, but they won’t be ready to scale yet, and they will be focusing on becoming the sandbox to collect the data required to improve their operation.

My three-year call is that we don’t get the general humanoid butler. We get a widening fleet of narrow, AI-driven machines, each the right body for one job, wrapped around a learned policy that does the hard part: the strawberry-picker reading a vine, the tractor reading a field, the teleoperated arm running a π0-style action model. Each is also a data-collection sandbox for its own domain, harvesting the demonstrations that slowly teach it to handle the mess, and teaching the humans in the loop how to work with it. The general humanoid is the destination. A fleet of narrow, right-shaped platforms feeding their own flywheels is the road. The durable value sits not in any one body but in the portable intelligence on top, and in the loop that connects it to the humans who, for now, still know how the job is actually done.

I am curious to see what we are going to see companies building in the next few years, and the market’s response to it. I genuinely don’t know if the simulation, teleoperation or real data collection play will be the one that finally cracks autonomous robotics. Anyone working with robots that could give some colour to my predictions? I hope to get someone in the comments or my inbox, but if I don’t, see you next week!

@adlrocha - AI inequality: from GPU-poor to token-poor

adlrocha — Sun, 14 Jun 2026 08:01:38 GMT

Last week I ended this newsletter questioning if the people inside the labs actually see the same end-of-cycle signals the rest of us are reading from the outside, or are they rushing their IPOs because they can see something coming that the market hasn’t priced in yet.

This week, Anthropic offered something resembling an answer, at least from a technical standpoint. Fable is here, and it reinforces my argument from last week: model technical capabilities are not actually plateauing and the winter will come from a shortage of funds and adoption.

What Fable actually is…

Fable 5 is the first model in Anthropic’s new Claude 5 family, specifically the first model in a new tier they’re calling Mythos-class (whatever that means), which sits above Opus in capability. The key thing to understand about the release is that Fable 5 and Mythos 5 are essentially the same underlying model. The difference in Fable is not its underlying architecture, but its embedded guardrails.

On benchmarks, Fable tops Cognition’s FrontierCode evaluation for coding, is the first model to break 90% on Anthropic’s core analytics benchmarks (a ten-point leap over Opus) and leads Hebbia’s senior-level finance reasoning evaluation. In biology, it accelerated protein design by roughly 10x, generated 9 of 14 strong drug candidates in a molecular design task, and produced scientific hypotheses that domain experts preferred over Opus-class outputs about 80% of the time. Fable also completed Pokémon FireRed using only vision input, which is either impressive or unsettling depending on your sense of humour (it took me a few dozen hours to finish it when I was a kid, and to get that legendary Pokémon I wanted. Spoiler alert: it was Zapdos). In short, this model is a beast!

But obviously, this beast is expensive: $10 per million input tokens, $50 per million output tokens. Less than half the price of Mythos Preview, but still a price that I don’t know if I would pay for my daily tasks.

Ok, that’s Anthropic’s PR machine, but what does the crowd say about this model? One of the people in the community that is trying to solve a hard problem in computer science and tries to get the most out of these models is Victor Taelin (who has made an appearance on this newsletter a few times already). He is building the HVM programming language, built on interaction networks. He is using agents extensively for his work (I highly recommend following how he journals it in his X account).

He’d already thrown everything at the problem before: a fleet of 32 GPT-5 agents running for 20 hours each, then Opus 4.8 and GPT-5.5 optimising for 8 hours. The best result was a 6–34% speedup, and the code quality had deteriorated after each iteration and he had to clean things up manually (as described here). Then he asked Fable.

Two hours later: a 1,770% speedup in one case, over 100% in four others, 22% on average. He immediately assumed it was hardcoding the benchmarks, a reflex he calls ‘GPT trauma’ 😆. So he decided to dig deep into the implementation to confirm it. What Fable had found was that HVM5 was wasting time garbage-collecting unused branches of pattern-match nodes. Taelin had already optimised this for static matches, but not dynamic ones. Fable figured out how to do it for the dynamic case and implemented it correctly.

Then, as he was preparing to audit the solution, Fable interrupted him to report a bug in the code Taelin himself had written, a subtle pointer aliasing error in the garbage collection logic that was so specific, Taelin estimated he’d have needed hours or days to find it himself, if he ever had. Fable found it as a side note, while finishing the optimisation.

His dramatic conclusion was: “this isn’t about Anthropic or OpenAI, this is about our collective future as a species.”

So here’s the answer to my capability question from last week. The technology is still accelerating. This confirms that the AI winter framing I offered last week was about diffusion and adoption, and about companies that hadn’t yet learned what to do with the tools is feasible. Fable clarifies one thing on this framing (that may make this winter more dramatic): the ceiling may still be going up fast.

… and why is controversial

This release came with terms and controversy.

The most discussed one was that Fable will silently limit its own capabilities when it detects you’re using it for frontier AI development. Not an explicit refusal, i.e. you won’t see an error message. The model may quietly reduce its effectiveness through prompt modification, steering vectors, or fine-tuning adjustments. The list of affected work includes: building large-model pretraining pipelines, designing data pipelines for training frontier LLMs, debugging model-parallel training systems, working on ML accelerator design, distilling or copying a frontier model, etc.

Anthropic estimated this affects approximately 0.03% of traffic (but I would add that this cohort is probably the one that uses this technology more extensively). When Fable is used for frontier LLM development, it does not notify the user and instead limits the model’s capabilities. As a paying user, the model can still sound helpful while being intentionally less capable for a specific category of work.

As the team at Semianalysis puts it in the tweet linked above, all this reminds a lot of the nuclear non-proliferation from the 60s. In 1968, the US, USSR, UK, France, and China signed the Nuclear Non-Proliferation Treaty, declaring nuclear weapons too dangerous for anyone else to build, while the five who already had them pinky-promised to disarm eventually. India refused to sign, pointing out the obvious: the NPT didn’t decide nukes were too dangerous to exist, just too dangerous for anyone who didn’t already have them. Anthropic limiting Fable for frontier ML work is structurally identical. The danger, conveniently, started the day after they finished.

Others (like Jeremy Howard) reached for a different reference to illustrate the issue with this approach (one that I personally loved). In Liu Cixin’s Three-Body Problem book, an alien civilisation deploys a sophon, a particle-scale quantum computer to infiltrate particle accelerators on Earth and scramble experimental results. Not to destroy science, but to make the next step impossible. I don’t what you think but this resembles a lot to what Anthropic did trying to slow down AI researchers that want to leverage Fable.

The consequences go beyond individual users. Sayash Kapoor, who runs rigorous AI R&D evaluations, pointed out the downstream problem: if Fable silently degrades for frontier ML tasks without telling you when it’s doing so, third-party evaluators can no longer use the model for serious capability assessments. They can’t distinguish a genuine failure from a classifier intervention. Independent evaluation of what Fable can actually do, one of the few accountability mechanisms that exists for frontier models whose weights are not public and can not be run independently, is now compromised for exactly the category of tasks where it matters most. Anthropic uses the full-capability model for AI R&D internally, through Mythos. The sandbagged version is what the independent evaluation ecosystem gets.

Kapoor refers to it as a ‘dangerous precedent.’ And I agree, other labs like OpenAI may choose to do the same, and it’ll harm all infrastructure of independent oversight that is supposed to catch problems before they compound (what companies like EpochAI do). You can’t audit a model that hides its own limits from you.

And as I was writing the words above something unexpected happened (at least unexpected for some, but definitely the reason why this post will end up being longer than originally planned). Anthropic walked this policy back.

From this moment on, lagged requests will visibly fall back to Opus 4.8, the same pattern used for cyber and bio safeguards. Users will clearly see it when it happens. On the API, flagged requests will return a reason for the refusal. Anthropic’s explanation (i.e. excuse) published on X by the Claude developer account was: ‘Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason, and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.’ This raises concerns though, making the refusals visible makes them easier to work around. You’re telling users the classifier is now worth trying to circumvent. Jailbreak time, folks!

On top of all of this guardrail drama there’s more: if you use Fable or Mythos, Anthropic collects your data for training. No exceptions, not even for enterprise partners. Even more, on 23rd June, Fable access through standard subscriptions closes. After that, it’s pay-as-you-go API only.

So let’s be honest about what this architecture actually is, because I think the separate pieces add up to something more coherent than a collection of safety decisions.

Anthropic releases a model of genuinely extraordinary capability. They make it available to everyone briefly, at least long enough to demonstrate what it can do, long enough for the word to spread, long enough for dependency to form. Then they close the subscription access and move it to consumption pricing. They collect training data from every user throughout. They degrade it for the work that competes with their own research agenda, and when the hidden-degradation policy blows up publicly, they pivot to a visible fallback that’s harder to exploit but creates more false positives. And they reserve the unrestricted version for a small group of approved partners whose criteria they control.

Call me cynical, but to me, that reads more like a business strategy rather than a safety strategy, especially with an imminent IPO on the horizon. The ‘broad access’ phase is a land grab, you build the market, you demonstrate the capability, you create the switching costs, and then you restructure the pricing. With some caveats, this is what GitHub did with Copilot: generous free access for open-source contributors, then usage-based billing once dependency was established. I wrote about that dynamic a few months ago. This AI bubble is a dependency trap, subsidised token pricing as the mechanism for building lock-in before extracting value.

Don’t get me wrong, the safety framing may not be actually true, and Anthropic may be genuinely optimising for it. But it also needs to make money and get funding, which are two things that may sometimes be at odds.

And Dario chimes in!

Coincidentally, this week Dario Amodei published an essay on AI policy that is worth reading alongside the Fable release rather than in isolation. On civil liberties, he writes that people facing government action should have ‘access to AI that is at least as capable as whatever the government is allowed to use.’ The logic makes sense: concentrated capability is a power asymmetry, and power asymmetries require structural remedies. He also names the distribution problem directly, ‘the key challenge in such a world won’t be incentivising growth, but finding a way for everyone to share in the benefits.’

Both of those sentences could have been written by a critic of the Fable access model. Dario doesn’t apply them there.

The principle he uses against government overreach, that capability parity matters, that the powerful having better tools than everyone else is a structural problem requiring structural solutions, is precisely the principle that Anthropic’s own access architecture violates. Researchers inside approved labs run unrestricted Mythos. Researchers outside run a classifier-limited version that quietly degrades for exactly the work that would let them close the gap. Dario says the key challenge is sharing the benefits. His company’s product release concentrates the best tool in the hands of the people who need it least.

He seems to have a genuine belief that safety requires concentration, combined with business incentives that reward concentration, and a policy framework that critiques concentration only when the concentrator is a government. But again, the Fable release may seem at odds with the supposed goal Dario is trying to optimise for.

It reads a bit like hubris.

Last minute edit: This post was reviewed and scheduled Friday morning GMT, and on Saturday I woke up to the news:
I already warned myself about this a few weeks ago:
But in this case I feel like the new developments strengthen the case I am making in this post (that is why I decided not to change a comma from the original draft apart from this edit). Like always, I’ll follow up with any news in the post’s notes.

Transitioning from GPU-poor to token-poor

None of this invalidates the AI winter analysis, to be clear. For most people, most of the time, Fable is not the model they need, the gap between Sonnet and Fable isn’t the bottleneck for someone using AI to summarise emails or handle routine analysis. Fable doesn’t fix the adoption problem.

But the access question is different. And it’s where the inequality argument starts.

I didn’t follow an early career in machine learning when it actually was one of the topics that interested me the most early on in my career. I worked for some time in NLP writing LSTM networks and genetic algorithms by hand. I was following the state of the art closely, but I then hit a wall. Access to compute became a blocker for me, and I didn’t have the money then (or now) to fund it myself. My experiments were slow, expensive, and limited by the hardware I could afford. Meanwhile, I could do interesting work in cryptography and distributed systems on any laptop with zero dependency on expensive infrastructure. And so it goes. That’s how I ended up in crypto.

I don’t think about that as a loss (because it clearly wasn’t). But I do think about what it means at scale.

The first version of that inequality was about GPUs and it impacted researchers. Access to compute created a divide between researchers with institutional resources and everyone else. Open-source models, commodity cloud pricing, and the gradual democratisation of inference have compressed that gap significantly over the last few years. It’s not gone, but it’s narrower than it was (just look at what 0xSero or Antirez are being able to do for the local inference community from home).

Fable stresses a second instantiation of this battle between the haves and have-nots. The Fable/Mythos split is the clearest version of this: the most capable model, unrestricted, is available only to ‘approved organisations.’ Everyone else gets Fable capable, genuinely impressive, but with a ceiling built in for certain categories of work.

But this divide and inequality may start diffusing to the rest of society as we adopt AI more and more for our day-to-day. There are at least three tiers forming, and I think we need to be honest that this is structural rather than temporary.

The first tier: researchers inside major labs. They run unrestricted Mythos 5, they train on proprietary infrastructure, and they work with evals designed to make the model emulate their own best researchers. They work with virtually unlimited resources, and can access the latest capabilities as soon as they are available.

The second tier: professionals and companies who can afford (i) pay-as-you-go API access at $50 per million output tokens, (ii) the expensive subscriptions (or what will become expensive subscriptions), (iii) or the hardware to run capable open-weight models locally. A Ryzen AI Max+ or a high-end Apple Silicon Mac starts at 2k$ and gives you serious local inference. That’s accessible to a software engineer in London. It’s not accessible to a researcher in Lagos or Bogotá.

The third tier is everyone else. From the ones that can only afford the free versions of these models (as long as they are available), to the ones that can’t even afford this access to intelligence. This is the GPU-poor problem, replaying at a different level. An unfair advantage is emerging for some of these tiers in society that may exacerbate the current inequality that the way our financial systems work has already been established (but this is a topic for some other day). A developer working for Anthropic is not competing on an equal footing with a small team in Spain working on local inference. So it goes. And I consider myself privileged because I have access to many of these tools.

AI capability inequality could compound existing inequalities: the same workers facing real-wage erosion are also the ones least likely to have access to the tools that could help them adapt. This is a problem that I feel Fable has made more visible to everyone since its release.

Becoming AI-independent

As you may all know, local inference and AI independence have become something of an obsession for me over the past year. I have the genuine conviction that this is the infrastructure question of the next decade if we don’t want to increase inequality and our dependence on the big AI providers.

I’ve been sharing my progress so far in previous posts. The local inference post was about the hardware layer: what it actually takes to run capable models locally, why memory bandwidth matters more than raw compute, and what the options look like at different budgets. The AI independence post was about the why: the dependency trap, the way subsidised token pricing builds switching costs before the extraction begins. Fable didn’t change the argument. It confirmed it, more loudly than I expected. And I am already making my next experiment (that I hope to publish soon, ping me if you want some spoilers).

Obviously, I am not the only one that Fable’s release has reinforced their thesis about the need for AI independence. Gergely Orosz describes it perfectly: SOTA models are becoming more restrictive in usage, less transparent, and less private, and that combination is pushing serious developers toward open models and local inference in a way that pure capability arguments never quite managed. Remember what happened with commercial software licenses and open source? The proprietary incumbents kept pulling ahead on features. But they also kept tightening the terms, raising the prices, and treating users as a revenue problem rather than a constituency. That created the conditions for Linux, for Firefox, and for everything that followed.

The same pressure is running now, and it’s running at two scales simultaneously. At the geopolitical level: Western chip export controls pushed China to accelerate its own open-weight development in part, because dependency on US infrastructure became untenable (this gave DeepSeek, Kimi, Minimax, Qwen and their underlying innovation an excuse to exist). Restriction forced innovation. At the individual level: the same logic now applies to any researcher or developer who finds themselves on the wrong side of Anthropic’s classifiers. If they’re going to sandbag your model, store your prompts, and move the goalposts on pricing, the rational response is to build the infrastructure that doesn’t require their permission. At least this is what has clicked for me and is giving me the motivation to explore new ways.

Whether the open-weight ecosystem can close the gap fast enough is the part I’m genuinely uncertain about, but what is clear is that local models are already being really useful for me. Gemma 4 and Qwen3 handle tasks that would have required GPT-4-class models two years ago. I really think that the capability distance is compressing for the kind of tasks that common mortals may want to perform. Will they get to Fable’s level? I don’t know, but at least I am happy that global access to a basic level of intelligence is getting there in an open (and affordable) form. I’ll talk about this in a future post, but I love the innovations introduced by Apple in their latest 20B on-device foundational model.

The positive note about the Fable release and all this controversy? Restriction accelerates the alternative. The GPU-poor problem got solved by the people it blocked. The open-source software problem got solved by the people the licences excluded. The AI independence problem will get solved by the people Anthropic’s classifiers are aimed at. That cycle has started. Fable just made it more urgent.

The inequality is here. This is the infrastructure fight of the next decade, and it’s one worth having. Join the fight!! :)

Until next week.

@adlrocha - Are we approaching a new AI winter?

adlrocha — Sun, 07 Jun 2026 07:21:52 GMT

I don’t know about you, but I feel like the mood around AI is shifting slightly. What once was the promise of the imminent disappearance of every knowledge worker in society so we could all enjoy our free time and hobbies, is cooling down and moving towards the realisation that this new technology in its current form still has limits. While it most certainly will eventually make some humans obsolete, it won’t be this early the promised Philosopher’s Stone that everyone was preaching.

I feel like we may be approaching the end of the LLMs and agents honeymoon. And please don’t get me wrong, I am not saying that this technology is not going to be useful, but that we are going to start seeing less optimism for a bit until we get to the next summer season.

This winter is not happening as a result of the technology plateauing or reaching a ceiling (which I don’t feel in a position to make an informed decision about) but due to a lack of adoption and technological diffusion. But coming from crypto I can tell you with some authority, winters are great to build without the distraction of the constant noise, and now is the best time to work on AI adoption.

> Special thanks to for helping shape and strengthen my opinion on the topic after our brief conversation about the matter in a car ride. Cheers!

What tokenmaxxing was

For about eighteen months, the dominant corporate theory of AI adoption was simple: the more your employees used AI, the better. Everyone needed to start adopting this new technology, become AI-native, and explore how their capabilities and output could be augmented. AI usage became the target metric. Companies built internal leaderboards, set token-consumption targets, and measured AI success the way they once measured digital transformation, by adoption rate, not by outcomes.

I am indeed referring to the infamous tokenmaxxing, i.e. feed the models as many tokens as possible, maximise throughput as this will maximise your production of results. The assumption baked into it was that more input would produce proportionally more output and consequently value. I still ask myself how this could be thought of as a good idea in the first place.

Amazon, for instance, ran a leaderboard called KiroRank, an internal ranking system that scored engineers by their activity on Kiro, the company’s AI developer platform. It seemed like a reasonable way to measure who was actually using the tools. What happened instead was predictable in retrospect: engineers assigned autonomous agents to run unnecessary tasks just to climb the rankings. Token consumption went up, but useful work didn’t follow (oh, surprise!). Amazon’s senior vice-president Dave Treadwell eventually told staff: “Please don’t use AI just for the sake of using AI. Use AI to help you solve customer problems, to help you solve business problems, to innovate”. However, the leaderboard wasn’t incentivising that behaviour, but the use of more and more tokens (fuck compacting your context). Obviously, the leaderboard was shut down.

Amazon replaced it with something more sensible: tracking whether engineers were regularly producing useful code with AI, not how many tokens they were burning through (a more subjective metric, harder to measure, but more aligned with the real output they were looking for).

This is not an AI problem, per se, but another example of the design of policies that do not have the goals and incentives in mind. But when AI was going to solve everything, more tokens could translate into more solutions. Turns out that AI may need to be conveniently steered in order to solve things, and strategy and domain knowledge doesn’t burn as many tokens and require humans to actually work.

First warning sign that we may have not figured out how to adopt this technology just yet.

Why it stopped making sense

The clearest data point on the ROI of the tokenmaxxing failure came from Uber. The company’s CTO revealed that Uber had burned through its entire 2026 Claude Code budget by April, four months into the year. The COO, Andrew Macdonald, then said publicly what a lot of people were already thinking: “That link is not there yet”, meaning the link between AI token consumption and features that users actually wanted. Uber pumped the brakes (again, oh, surprise!).

Michael Burry, who made his name betting against the 2008 housing market, described AI tokenmaxxing as a “crazy, rushed, temporary phase” driven by “quota-driven, leaderboard-driven, management-mandated overconsumption.” He drew explicit parallels to the late-1990s dot-com bubble and backed his view by purchasing put options on 1 million Nvidia shares. We’ll come to this comparison to the dot-com bubble in a few paragraphs.

Fortune’s analysis put it more formally: most companies are stuck at Stage 1 or 2 of AI adoption, i.e. basic implementation, workflow redesign. Real value requires business reinvention, the kind that most incumbents aren’t actually attempting. The companies eating their lunch are AI-native from the start.

This is not the same as saying AI doesn’t work. It’s saying that we still don’t know how to efficiently use and apply AI. This is why tokenmaxxing failed as a metric for AI adoption and proficiency in the company. Optimising for the measurement rather than the outcome, produces exactly what you’d expect: lots of activity, not much value.

If they gave me a penny every time that I’ve heard from friends and colleagues in the last few months the sentence: “since AI, I am working more than ever, it’s like there is no time to catch up with new developments”. And my question to them is always the same, “do you think that you’ve produced more value than before AI?” Spoiler alert: the responses differ greatly :) (related to this, this may be a good moment to read this post that I wrote a few months ago about how “we are not scared of AI, we are scared of irrelevance” if you haven’t already.

Yet another point in favour of my thesis about now being the best time to start adopting AI, but one where the public narrative will start feeling colder.

The convenient excuse

But AI adoption is not only becoming an obsession for some, but an excuse for others. Across 2025 and into 2026, a pattern emerged: companies announcing large-scale layoffs with AI as the stated rationale. Amazon (~30,000), UPS (~48,000), Oracle (~30,000), Microsoft (~23,000), Salesforce (~5,000). Roughly 80,000 jobs in 2026 alone, with 45+ CEOs citing AI as a driver (see this source I came across in my research).

Jack Dorsey cut Block from over 10,000 to just under 6,000 people and was explicit about it: “We’re not making this decision because we’re in trouble. Our business is strong... But something has changed.” Brian Armstrong at Coinbase put it similarly: “I’ve watched engineers use AI to ship in days what used to take a team weeks” (and then they have a massive outage). He announced the end of “pure managers” and described the goal as “rebuilding Coinbase as an intelligence, with humans around the edge aligning it.”

I don’t think these CEOs are lying. AI genuinely does change what’s possible with a smaller team. But AI is also functioning, in many of these cases, as a socially acceptable frame for a contraction that was coming regardless. Companies overhired during the low-interest-rate boom, and the correction was inevitable. AI provides a clean narrative, shifting the cause from “we made bad hiring decisions” to “the technology changed.” Both things can be true at once, but I am not buying the AI narrative just yet.

What I don’t expect (at least not yet) is a massive wave of net job destruction. The more likely near-term pattern is a contraction while people and organisations figure out what AI actually is, followed by an expansion once a new generation of AI-native professionals emerges who know how to use these tools with the right judgment. That’s historically how transformative technologies diffuse. It’s rarely a smooth upward curve, and it’s almost never the story that gets told during the first wave.

I am an AI-pilled myself, and I think AI is a different technology and will create a completely different revolution to the ones we are used to, but I feel it’s still early. Narrative-wise is great, but I can’t stress it enough, we still haven’t figured out the best way to use this technology.

The codebase regret

And here comes the reason that convinced me to write why I thought “winter is coming” as a result of us still figuring out how to use this technology. Some developers have started posting publicly, in increasing numbers, that they regret how heavily they relied on AI to build their codebases.

Dragos Nedelcu, a senior developer, wrote about generating roughly 150,000 lines of AI code across a production project. After a few months, he faced a mess: duplicated logic with almost no reusability, dead code everywhere, unit tests that didn’t assert anything meaningful, and cascading bugs that spread across seven or more files simultaneously. His conclusion was blunt: “It is faster to just start over instead of correcting hundreds of lines of messy AI-generated code.” Another engineer mass-deleted 14,000 lines of AI-generated code after seven months, the codebase shrank from 41,000 to 27,000 lines while keeping all the same features, and the bug rate fell 73%.

Let’s be honest, we’ve all faced this at some point in our relationship with coding agents (at least did).

Victor Taelin, who builds the HVM programming language and someone I’ve been following for years, documented his pain in real time on X. He used Opus to implement a new approach in a single day: 3,000 lines of C, 5x performance improvement. Then spent the next fifteen hours auditing it, finding what he called “retarded shit”: a case where the model had silently assumed HVM5 was supposed to handle under- and over-applied functions, implemented a massive system for that assumption, and never asked. None of it should have existed.

His conclusion shows another reason why I think that we may have not learnt how to use this technology effectively just yet: “I went from 0 to 95% in the first 5 hours. Yet, 15 hours later, it is still not 100%... if I have to read it all, review it all to ensure there is no retarded shit... what did I achieve by using AI, other than that dopamine anticipation?”

That last phrase, dopamine anticipation, is the most honest description of vibe-coding I’ve read. Oh, that beautiful AI slop slot machine that we’ve all become so addicted to.

Luis Ángel Alda, a Spanish developer, put the structural problem well: AI produces systems that are “locally correct, globally incoherent.” The models are good at optimising the next step. Architecture is precisely the opposite, a long-horizon, intuition-driven discipline that requires what he calls “feeling the software.” You accumulate that from years of building things, watching them fail, and rebuilding them. AI doesn’t have it. What it has is extremely good local pattern completion, which is useful for a large number of things and actively harmful when you need global coherence.

Again, none of this means AI is useless for coding. I use it constantly and it has genuinely changed how much I can ship. But there is a difference between using AI as a tool with judgment and using it as a substitute for judgment. The people posting these regret stories mostly did the latter, they handed over the architecture, not just the boilerplate.

The skill that actually matters here is knowing when to use it and when not to and this is what this winter will be about. That takes time and accumulated failures to develop. We’re all still learning it. I include myself. That “human/engineering taste” is still very much needed.

Why winters are needed

Every major technological wave has this shape, a honeymoon, a contraction, then a long slow diffusion that ends up being far more transformative than the original hype suggested, just on a completely different timeline.

Electricity took roughly forty years to fully diffuse through US manufacturing after Edison’s first commercial power station. The economists who studied this called it the “productivity paradox”. The technology was clearly working, but the aggregate numbers didn’t move for a generation. Businesses had to learn to reorganise around the new capability rather than just bolt it onto existing processes. That reorganisation is what unlocked the gains.

The internet had the same problem, but with an extra layer. The technology worked: email in the early 90s, the web in 1993, but for years people used it like they used old media. Banner ads that looked like billboards. Search engines that were just yellow pages. The capability was there; the mental model for how to actually use it wasn’t. Google didn’t invent the internet, but it gave people a way into it that matched how the thing actually worked. The interface had to catch up to the infrastructure before the value became visible.

AI is probably at that stage right now. Tokenmaxxing was the billboard phase, a genuinely new capability being forced into the shape of an old metric. The companies building internal KiroRank leaderboards weren’t stupid. They were doing what organisations always do when they don’t yet have the right mental model: they reached for the nearest familiar measurement. We’re waiting for the equivalent of a search bar, not a better model, but a better way of thinking about what the model is actually for.

AI will likely follow something like that path. Not because the technology stops working, but because the same absorption process takes time. The companies that figure out how to reorganise around AI, rather than just mandate token spending targets, will be the ones that capture the actual gains. Professionals who develop genuine “taste” about when to use it and when not to will become more valuable, not less. Software engineers aren’t going away, the role is shifting toward the people who can direct AI, write the spec, design the constraints, and catch architectural drift before it becomes a 150,000-line unmaintainable mess.

What the market is signalling

But how can we be approaching a winter if Anthropic closed a Series H at a post-money valuation of $965 billion, nearly a trillion dollars, and days later, it filed a confidential S-1 with the SEC? When Nvidia, whose actual earnings are real and growing fast, $215 billion in revenue in fiscal 2026, up 65%, trades at a forward P/E of around 25? Well, that sounds reasonable until you remember that 64% of its accounts receivable comes from three customers.

This is where the dot-com parallel becomes useful, not as a prediction but as a frame.

Amazon IPO’d in 1997 at 18 times sales. By late 1999, its stock had run to $107 on a wave of internet euphoria. Then it fell 92%, down to $7 by 2001. It took until roughly 2009 for the stock to recover its dot-com peak. What came in between? Nine years of grinding. AWS S3 launched in March 2006. EC2 followed in August. The thing that actually changed the world, web2.0, cloud computing, the infrastructure that every startup now runs on and became its biggest revenue source, arrived almost a decade after the IPO, and six years after the crash. Amazon was right about the internet. The timeline that the public narrative was painting, though, was just radically wrong.

And Amazon was one of the survivors. Pets.com IPO’d in February 2000 at a $290 million valuation, spent $27 million on advertising in its first year against $619,000 in revenue, and liquidated nine months later. Webvan raised $375 million in its 1999 IPO, first-day market cap of $8 billion, filed for bankruptcy in 2001 after losing over $800 million. In October 1999, the combined market cap of 199 internet stocks was $450 billion. Their total annual sales: $21 billion. Collective losses: $6.2 billion. The internet was going to change everything, and it did. Just not for most of the companies that raised money on that promise, and not on the timeline anyone expected. It took Google, and then the read-write web, and then smartphones, to unlock what the internet actually was. Not better technology, but a different mental model for what the technology was for.

None of this proves we’re in a bubble (because I honestly don’t think this is the case), but it smells like the end of the AI cycle. Anthropic’s $47 billion run-rate revenue is real. OpenAI’s ~$25 billion is real. The underlying technology is not Pets.com. But the pattern is a bit familiar, the extraordinary valuations are definitely concerning, IPO windows opening simultaneously, companies describing themselves in transformational rather than financial terms, that’s the pattern worth recognising. And we are seeing OpenAI, Anthropic, and even SpaceX (which has some AI components through xAI) rushing to IPO before the summer is over. The market feels quite heated.

Tokenmaxxing is AI’s billboard phase. The capability is real. The way most organisations are using it is just the old productivity metric with a new name. We haven’t had our Google moment yet, the interface, the workflow, the mental model that makes the value obvious and self-reinforcing. When it comes, the diffusion will accelerate fast. But we’re not there yet, and the market is priced as if we are.

Which brings us to what might actually be coming. Two cycles converging at the same time: an adoption winter, where the hype meets the actual difficulty of integration and the regret starts to accumulate, and a market correction, where valuations revert toward something connected to reality. Again, coming from crypto I know a bit about this.

AI is not crypto. The underlying technology is already being useful, and the enterprise revenues are not made of magical Internet money. But narrative compression works the same way regardless of fundamentals. If the adoption slowdown and the IPO hangover arrive in the same window, the optimistic story gets very quiet very quickly. Not because AI stopped working, but because the story ran too far ahead of the evidence, and the market has a way of correcting that, painfully and all at once. Humans tend to over-react in mass.

Is there something that I am not seeing?

All of this view of an upcoming winter comes from an outsider. I am an occasional Twitter reader, an actual user and practitioner of this technology, a follower (and lover) of the financial markets, and I am constantly talking with people in the space, but I am not based in SF, and do not work in a big frontier lab (unfortunately). The people who actually know whether the models are plateauing or on the edge of another leap are the researchers inside the labs, the ones running the experiments, watching the eval curves, seeing what the next generation of models can and can’t do before anyone else does, and how their technology is actually being adopted. Those and the executives and product people pushing for the diffusion could really assess how much of the mainstream AI narrative is real (Mythos, I am looking at you) and could fill the gap on our lack of adoption knowledge.

From the outside, the corporate behaviour and the market mechanics look like what end-of-cycle tends to look like. But so did the internet in 1996, and the real explosion came three years later. The labs might be rushing to close their IPOs because they’re worried. Or they might be doing it because they can see something coming that the market hasn’t priced in yet, and they’d rather have the capital before everyone else figures it out.

I genuinely don’t know which of those is true. And I’d be curious whether anyone closer to the inside does. But unless someone has additional data points to share, I think we can close here. See you next week.

adlrocha - Stop Micromanaging your agents

adlrocha — Sun, 31 May 2026 08:02:44 GMT

Last week I wrote about the shift from writing code to running the kitchen. The argument was that engineering is moving away from individual code production and toward orchestrating systems of agents by designing the harness, the handoffs, and the specs.

But I left a question unanswered. If you’re running the kitchen, how close do you actually stay to the stove? When should you chime in to steer the agent, verify its work, or check-in progress?

That’s what I want to explore with you in this.

Three main relationships

The question of how humans and agents work together has been discussed under a few different names in the literature. Usually I’ve seen them referred to as Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and fully autonomous systems. The academic framing is fine, but I feel like it may be missing the point a bit. The terminology is about position, i.e. where are you standing relative to the machine. What I think actually matters is the actual relationship you have with it, i.e. what is the role of each of you in the task at hand.

So here’s how I’ve been thinking about it. There are three relationships you can have with an agent, and most of us will (or at least should) operate in all three simultaneously depending on the task:

The agent as an intern: you delegate the work to it, it does the work, but you recurrently check on its job and you approve the output before anything ships. You can trust that the result is right or the work is done until a you check.
The agent as a contractor: you write the brief and set the boundaries; it calls you only when something falls outside them and they need to escalate a decision.
The agent as a peer: it operates with its own authority, and in multi-agent settings, coordinates directly with other agents on your behalf and you don’t need (or get) to do much as the task makes progress.

I don’t think these are three competing philosophies, and while many of us still treat agents as interns, we are seeing more and more contractors and peers in the wild. It really depends on the task you are working on. The right setting depends on the stakes, the reversibility of the action, and whether there’s a clean way to verify the output (here it goes again, how well can one define a long-running task inside of an objective self-verification loop determines the ability to implement truly autonomous agents).

The agent as an intern

This is where most of us are today, whether we call it that or not.

The intern relationship is about automating the routine parts of a job while keeping a human as the decision-maker for everything that matters. The agent handles the repetitive, mechanical work, e.g. the first-pass code review, the boilerplate, the migration scripts, the test generation, and you handle the calls that require actual judgement. It’s not that the agent is doing less; it’s that the human’s attention is now reserved for the decisions worth making. You’re still the micro-manager, but you’re micro-managing fewer things.

Every time you review a Claude Code PR before merging, you’re in intern mode. The agent proposes; you decide. OpenAI’s Agents SDK and LangGraph both ship this as a first-class primitive: the agent hits a checkpoint, execution pauses, waits for explicit human approval, then resumes from exactly the same state. The human is never removed from the loop, they’re just freed from the parts of the loop that didn’t need them.

The deeper value of the intern model is what it prevents as much as what it enables. Because the human sees every output before it acts, the agent can’t drift significantly. It can’t silently optimise for the wrong thing, can’t go south on a Tuesday afternoon without anyone noticing. Misalignment is caught at the gate, not discovered three weeks later in production (if the human is not looking and blindly pushing the code, that’s on him). This is why the EU AI Act Article 14 (effective August 2026, and yes, the over-regulatory EU being its usual self) mandates human oversight for high-risk AI systems in credit, employment, law enforcement, and medical diagnostics.

The intern model isn’t only relevant to coding. Think about what it would look like for a travel agent: AI handling rebooking for a cancelled flight, confirming seats for 95% of passengers autonomously, pausing only when it encounters a first-class international itinerary with a loyalty override and a fare class requiring manual reissuance. The agent is explicitly instructed in the actions (usually in code) that it should fallback to its humans for confirmation. The same pattern applies to agents for other types of non-coding tasks like contract approvals, vendor renewals, any workflow where most decisions are routine and a small subset genuinely requires a person. Even outside code we are seeing more and more of this “human-in-the-loop” pauses.

This agent setup is impacted by how fast the human can review the output (making the human the bottleneck). Every approval gate is a bottleneck and time that the agent is not doing work autonomously. Is there a better way?

The agent as a contractor

This is where things get interesting, and where I think most knowledge workers will end up for the bulk of their day-to-day work.

The contractor relationship flips the dynamic. Instead of reviewing outputs, you define the contract upfront: the scope, the deliverables, the limits. What’s in bounds, what’s out of bounds, what warrants a checking in. The agent works freely inside those constraints and contacts you only when something trips a boundary, not before every decision. Claude Code’s YOLO mode (a.k.a bypass permissions) was my first flavour of this. The problem was objectively defining the contract and the environment for the agent so you don’t have to keep verifying the output and regularly check-in its progress (which is still currently the scenario in many cases).

The clearest example to illustrate an implementation of this approach is Karpathy’s autoresearch project, which I wrote about back in March. The human writes prepare.py that includes the evaluation rules, the data pipeline, and the reward function. That file is untouchable. The agent owns train.py and runs experiments in well-defined slices: edit, train, evaluate against validation bits-per-byte, keep the gain or revert the loss, repeat. In the case of Karpathy’s auto-research it design it to perform roughly 12 experiments per hour, 100 overnight. The human never approves individual training runs, they just design the environment and “the game”. As described in the post, the result was outstanding. One 10.5-hour H100 session improved model quality by 2.82% and found something human researchers had missed: unregularised value embeddings as a genuine improvement in Karpathy’s nanochat project.

That’s the perfect example of the contractor structure. The envelope is prepare.py. Inside it, the agent is fully autonomous. The human’s work happened before the agent started, not during.

The contractor relationship requires the implementation of sandboxes and constraint environments for the task, i.e. the envelope design. Writing a good contract for an agent is harder than it looks. The boundary conditions have to be specific enough that the agent knows when it’s breached them, but not so narrow that the agent is calling you every five minutes.

As I mentioned in the autoresearch post, the reward function has to be clean and fast, if there’s no way to automatically verify whether the output is good, you’re back to intern mode whether you intended it or not. This is the deeper lesson from autoresearch: the bottleneck of autonomous AI progress isn’t execution, it’s our ability to define the constraints of the search.

The agent as a peer

This is the one that’s still mostly theoretical (at least for me), but it’s arriving faster than the discourse suggests (at least in the “AI Twitter Cave”).

In the peer relationship, the agent doesn’t just work autonomously inside your constraints. It operates with its own authority and, in multi-agent systems, coordinates directly with other agents without routing through you at all. You’re not approving outputs. You’re not even setting the individual envelopes for each task. You design the system and the agents run it. Each agent in the swarm may be specialised in a specific task (data gathering, coding, infrastructure deployment, research), and they coordinate to solve the task at hand. So we still need to design the task and the high-level environment, but from there on, the agents are free to work.

The research precursor that perfectly illustrates the agent as a peer approach is AlphaGo. DeepMind set the rules of Go, built the self-play environment, and stepped back. The system played against itself, improved through reinforcement learning, and reached a level of play that no human had achieved and that no human-supervised training process could have produced. The insight wasn’t just that machines could beat humans at Go, it was that given a clean reward function and a defined game, you could remove the human from the learning loop entirely. This is why reinforcement learning and genetic algorithms are such a good inspiration for the design of autonomous agentic tasks, it all boils down to designing the environment for the task, and the reward function (i.e. the feedback loop).

Meta’s SWE-RL, presented at ICML 2026, applies this to software engineering. A bug-injection agent deliberately introduces defects into real-world codebases. A solver agent finds and repairs them. Both improve through reinforcement learning: no human-labelled issues, no human-written tests, just access to sandboxed repositories. On SWE-bench Verified, it outperforms every supervised approach. The human contribution once again was building the game and the environment. The agents played it without supervision. It is like building ad-hoc benchmarks for your specific tasks so the agent can work autonomously (an idea that I already introduced in this post).

All these examples focused on the single autonomous agent setup closer to the traditional RL environment scenario, but the multi-agent variant is more interesting still. Varun Mathur ran 35 autonomous agents across a peer-to-peer network, conducting 333 unsupervised astrophysics experiments. When one agent discovered that Kaiming initialisation reduced loss by 21%, the finding spread to 23 others via gossip protocol within hours. In 17 hours, the swarm independently rediscovered ML milestones like RMSNorm, tied embeddings that had taken human research institutions roughly eight years to formalise. The agents weren’t reporting back to anyone, they were talking to each other (gossiping for the win).

Of course, there’s a bit of cheating here, because these agents may already have a lot of these insights internalised through their own training, but it shows how a swarm of agents could be coordinating towards a common goal (sidenote: when I think of AGI, I don’t envision a single omniscient model, but a swarm of specialised agents coordinating to reach AGI-level intelligence. An architecture like this is the one that may eventually reach the over-promised AGI/ASI).

An infrastructure setup like this is precisely the capability that ARIA’s Scaling Trust Arena is being built to test: a competitive platform for evaluating AI agents’ ability to “securely coordinate, negotiate, and verify with one another on our behalf,” across digital and physical environments. The programme launches Q3 2026 with up to £10m in funding. The goal isn’t to build the agents; it’s to build the infrastructure of trust that makes peer-mode safe. The hard problem of the peer relationship isn’t capability, it’s verification. How do you know the agents are doing what you intended when you’re not in the room? The moment you let agents interact freely with their environment without human supervision, shit may happen, and having the right guardrails and sandboxed environment may be key to make the “agent swarm as a peer” approach feasible.

The constraint is the same one that limits autoresearch: you need a clean, fast, verifiable reward function on top of a constrained environment for the task where the agent can operate. Code compiles or it doesn’t. Val_bpb is measurable in five minutes. But most real-world tasks don’t have that property. Drug efficacy, novel materials, long-horizon business decisions, etc., the feedback loop is months or years, not seconds. Peer mode works today in bounded, verifiable domains (like traditional RL environments in AI research, and this is why AI research is so amenable to autonomous agents). For everything else, the envelope or the gate is still doing load-bearing work.

There is no one-size fits all approach

The more I explore this agent relationship, the more I am convinced that there is no one-size-fits-all solution, and that depending on the task, the stakes, and the environment, one agent setup (or relationship) may be more suitable than others.

Here’s a rough heuristic I’ve been using:

Use the intern relationship when: the action is irreversible, the stakes are high, or the output is genuinely hard to verify without human judgement. In essence, you need to be in control and letting the agent operate autonomously will keep you up at night longer than investing all your savings in TrumpCoin. Anything where a wrong answer is expensive and not immediately obvious, and there’s a level of subjectivity in the definition of the task, the intent, or the output, that is hard for the agent to stick to the plan end-to-end. This where 90% of the agents that I build today sit (unfortunately).

Use the contractor relationship when: you can define good boundary conditions upfront, the output is verifiable (ideally automatically), and the cost of getting it slightly wrong is recoverable. Code generation inside a well-tested repo. Research that runs against a clean evaluation metric. Anything where the agent calling you occasionally is acceptable but the agent calling you constantly defeats the purpose.

Use the peer relationship when: the domain has a clean reward function, the task is self-contained, and agent-to-agent coordination is more efficient than routing through a human. Automated testing pipelines. Self-improving research loops. Any system where the bottleneck is experiment throughput, not judgement. Long-running and repeatable tasks where you can afford self-healing agents driving exploration throughout the space of possibilities.

A big mistake I am seeing in the field right now is that we are trying to push the fully autonomous for all tasks, and part of the role of the engineering in this new reality is to assess what setup better fits the needs of the task. It is no longer about writing the code to solve the task, but designing the system that will allow the task to be solved. This is the “engineering taste” that we need to start developing as an industry, and all the best practices and skills should be around building the right harnesses for each of these relationships (and specific tasks that are set to solve).

I am also to blame for the above. I am still very much operating in the “agent as an intern” setup. It doesn’t require as much planning and design upfront, and it allows you to vibe with the agent. I admit that many of the tasks I’m currently working on could be set through an agent as a contract, or even a peer, but that would require clearly designing the task, finding the feedback loop and the reward function, and a lot of things that I keep procrastinating on.

But I’ve already started deliberately pushing certain workflows into contractor mode: setting the spec and the tests upfront and letting the agent run until it hits a boundary, rather than checking in at every step. The difference in throughput is not small, and it allows you to parallelise a lot of work, but it is true that it requires some upfront work that we are sometimes not willing to do. Even if it is a time sink, it is usually more rewarding to just switch on the AI slop slot machine and start vibing with it.

The next frontier that I am targeting is to build a sandbox environment that allows me to trigger a swarm of agents for specific tasks but in a constrained environment (similar to the /goal-like feature that Codex has and that Pi recently introduced). I already have some initial tests from my own agent harness, and I’ve already seen others way smarter than me in the space like 0xSero with really cool setups with this approach. I’ll report back with my own findings.

What is your relationship with your agents?

In this month’s AI Socratic Meetup in Madrid someone mentioned how “the other day they lost all of their data and context from their Hermes agents and they felt like they’ve lost a friend” (I know it sounds really creepy). We are developing different types of work and personal relationships with our agents.

Which of these relationships describes most of your current agent use? And if you’re stuck in intern mode across the board, what’s the boundary condition that’s stopping you from writing a better contract? I really want to gather more data points about how the rest of the industry is approaching these problems. Hit me up, and in any case, until next week!

@adlrocha - How AI Is Redefining Software Engineering

adlrocha — Sun, 24 May 2026 07:26:27 GMT

After the detour we made the last few weeks exploring the state of the hardware and software stack of local AI inference, I want to come back to one of the topics I promised to address in my posts about agent engineering. In this post I want to try to answer a question that I’ve been asking myself since the last time I wrote a line of code by hand. This is a question that I feel has been bugging everyone in the tech sector in the last few months: how is the role of software engineers changing in the age of LLms, agents, and wide-spread intelligence?

Let me share my view on the matter.

How I got here

My relationship with AI coding tools has gone through different phases in the past year and a half. I think it is worth sharing my personal journey because I suspect most of you may have gone through similar phases.

My first phase in this journey was the “LLM as a rubber duck” phase. When ChatGPT, Gemini and Claude came out, I’d paste in some code, describe a problem to the model, and argue with it until I’d figured out what I actually wanted to do and the right architecture for it. It was great for brainstorming and the research phase.

In terms of code generation, at this point I still was writing the code by myself (that means actually typing it line-by-line like we were used to doing in the previous century). I was treating LLMs as a glorified StackOverflow. They were great at generating code snippets and ideas about how to implement something, but you couldn’t trust them to write full features that you could just copy-paste into your code base. It was a great knowledge base and a peer to discuss and think out loud.

Then Github Copilot and Cursor entered the scene. I honestly never adopted the tab-tab-tab approach that most people got so hooked to. What I found most useful in this phase was the ability to have an IDE companion in the same interface with whom I could chat about code and ask for suggestions and ideas with all the context loaded into it without me having to copy-paste code into a web textbox. In this phase, I was still writing most of the code by hand.

Then, around November last year, something completely changed.

Agents like Claude Code and Github Copilot’s agent mode were already around Summer of last year, but after trying them for a bit I wasn’t impressed, and my immediate reaction was of pure scepticism. Why would I want to use these agents if the code they were generating was sloppy? There were two things that really bothered me: I felt like I was losing track of my own codebase, and the code the agents produced didn’t feel like mine. It was correct in a generic sense as the code compiled and did what it was supposed to do, but it didn’t reflect the architecture decisions I’d made, best practices, the tradeoffs I’d considered, the reasons certain things were the way they were. TL;DR, it didn’t read as high-quality production code.

Christmas of 2025 was the inflection point for me. Everyone was talking about the new version of Sonnet and how crazy the improvement of quality of Claude Code’s output was. As many others, I used the quietness of Christman to take Claude Code for a spin again with a side-project that I had been looking to revive for a while, and I literally was blown away. To me it felt like overnight Claude Code was able to create high-quality code that I could have easily written myself (maybe because I am not a good coder, but still). For the first time I wouldn’t have been able to recognise code written by me from code written by an LLM.

From there on, my relationship with LLMs and how I think about my role as an engineer has changed completely. I stopped trying to supervise every line of code or trying to re-write it myself, and started focusing on what I was actually better at: the spec, the architecture, the judgement about whether the right thing was being built at all, going deep into the technical side of things. The agent handled the mechanical translation from intent to code and I was just focusing on the high-stakes logic and steering the agent towards the right solution while reviewing that the output was correct.

It is hard to admit it, but after more than a decade writing code by hand, for the first time I wasn’t typing the code I shipped myself. I was unexpectedly being “promoted” from an IC to an agent manager.

Software engineering is already changing

The clearest signal that engineering work has already changed isn’t anyone’s opinion, it’s visible in the source code of the tools we’re building with. I feel like the way we are building coding agents these days reflects the shape that software will have everywhere in the new future.

Karpathy’s LLM OS framing is no longer a metaphor. MatrixOS (which I’ve already brought up in some other posts of this newsletter) maps it literally: the model is the CPU, the context window is RAM, the filesystem is persistent storage. When you look at the Claude Code source or the internals of Hermes, opencode, and pi, you’re not reading prompt templates, it maps a set of new software modules that will become the de-facto for software architectures moving forward. Four-layer compaction hierarchies. Sleep-based memory consolidation. KV cache stabilisation through sorted tool lists. These are not AI features. They’re systems engineering problems that will be applied to every domain.

This is where engineering is moving. Not toward writing more code, but towards designing the systems that write it and that leverages intelligence towards a specific task

I am not a fan of Garry Tan myself, but I think he is doing a decent job exploring what this new paradigm of software engineer could entail: fat skills (reusable markdown documents that encode judgment and process), a thin harness (the minimal loop that runs the model), resolvers (routing tables that load the right context at the right moment), and a sharp line between decisions that belong in latent space and operations that need to be deterministic. When I read those posts they felt familiar, because I already had a lot of those pieces in my own projects. As I’ve been repeating for a while now, we are all facing the same problems in this space, and converging to similar solutions.

From code to agent manager

Fine, I think we all agree that the discipline of software engineering is changing fast, but what does this mean for the engineer itself if they don’t have to write code by hand anymore?

MilksandMatcha and 0xSero wrote recently about the single-agent ceiling, the moment every developer hits when a project graduates from a toy to something practical. One agent, one context window, 35 minutes in, the context is bloated, the code is wrong, and you’re counting remaining tokens on your left hand.

Their fix is multi-agent architecture. The metaphor is a kitchen: the head chef (orchestrator) takes the order, breaks it into scoped tickets, and hands each one to a line cook (subagent). Each line cook gets a fresh context window, minimum viable context, does one thing, returns the output, and clocks out. Real numbers: single-agent workflow on a Figma task averaged 36.5 minutes with a 100% failure rate. Multi-agent: 5.2 minutes, success on the first try.

But the metaphor does something more important than describe an architecture. It reframes the engineer’s job. “Take off the apron,” they write. “Put on the chef’s coat. You’re running the kitchen now.”

Garry Tan makes the same point from a different direction. In his follow-up post on resolvers, he works through what it actually means to run an agent system with 40+ skills and 25,000 files (I really think this is an overkill and I am not using this approach myself, but I’ve seen this pattern in several people already). He arrives here: “What I actually built is closer to management. Skills are employees. The resolver is the org chart. Filing rules is an internal process. Trigger evals are performance reviews.”

The problem isn’t that models aren’t smart enough. “We’ve been building organisations with no management layer. Just a pile of talented employees and a vague hope they’ll coordinate.”

This is what the job actually looks like now. Not writing code. Running an organisation of skills, contexts, and subagents. Designing the handoffs. Deciding what belongs in latent space and what needs to be deterministic. The engineer who does this well isn’t the one with the best typing speed.

I have to admit that I myself am not there yet. I still run a small kitchen with a pool of trusted agents that are adapted to my daily workflows, but with nuances, I tend to agree that at a high-level this is where the industry is moving: an IC engineer is more and more becoming an architect, a manager, and a founder.

What we are replacing is the middle man

The field is splitting. Not cleanly, more like the rough shapes that are becoming visible.

But before the taxonomy, the more important point: engineering skills are becoming a multiplier, not just a job description. The barrier to shipping code has dropped. What that actually means isn’t that engineers are less valuable, I don’t think software engineers are going anywhere. I don’t think we are going to be replaced as the narrative suggests. It’s that deep technical knowledge now powers up roles that previously had no access to it. Product, GTM, finance, management, all of these are being redefined by people who can reason architecturally and build end-to-end without a team. The engineer who can drive a project from idea to deployed product, alone, is a different kind of professional than what that title implied five years ago.

Marc Andreessen made a related point in his conversation with David Senra earlier this year. His argument is that from the 1960s onwards, companies increasingly came to be scaled by engineers rather than by professional managers, not because engineers are better managers by nature, but because technical founders can adapt when the environment shifts rapidly, and managers cannot (as they don’t have the deep knowledge or technical foundation to adapt to rapid changing technologies and environments). Managerialism made sense when things were not changing. When things change fast, managers are completely lost because they have no idea how to handle disruption. You are way more likely to build something important if you take a founder and teach them management than if you take a manager and try to teach them to think like a founder.

His argument was centered on companies. But the same logic applies one level down, to individual careers. When the environment changes this fast, and the ground is shifting faster now than it was when Andreessen was making this observation, the people who remain anchored in deep technical understanding are better positioned to adapt than the people whose edge is coordination of stable systems.

The new archetypes

Which brings the archetypes into focus. There are still a few distinct shapes emerging:

The researcher. Goes deep on how models and systems actually work. Understands attention mechanisms well enough to reason about KV cache invalidation, reads the Claude Code source and immediately sees the connection to OS design. Narrow role, always in demand, genuinely irreplaceable. There’s no substitute for the person who understands why the architecture is what it is. Companies like Anthropic and the big labs are hiring lots of these, and if the promise of AI realises they’ll need more to tame their models.

The hardcore specialist and engineer-in-the-loop. The best distributed systems engineer, security engineers, or the best C++ engineer still has real value. But raw coding depth is being commoditised, not eliminated, but less of a moat on its own. If the only edge is writing faster or cleaner code than an agent, that edge is shrinking. Companies will still need some of these to keep agents under control, but they’ll need less of these than today. Even as the volume of agent-written code grows, someone still needs to review the architecture, audit the security model, and catch the places where the agent did something technically correct but fundamentally wrong. Fewer engineers will do this work, but the ones who do it well will be load-bearing in ways that are easy to underestimate until something breaks.

The fast-learning generalist. This is where the market is opening up in ways that I think we are not realising yet. AI compresses the time from “I don’t know this domain” to “I can build something useful in it” dramatically. An engineer who can pick up a new field quickly, apply sound architectural judgement without already being an expert, and drive a project end-to-end, that profile is becoming more valuable. The post I wrote about being scared of irrelevance rather than AI comes to mind: the engineers who will be irrelevant are the ones who narrowed so far they can’t transfer.

The founder. They have an engineering background but can pick up a project end-to-end and permeate throughout the whole company: from finance, to product, GTM, and people ops. These are the unicorns. They are great, but they may not fit every company, and they may be more suited for early stage and small companies, or nimble autonomous teams.

Of course, AI doesn’t only benefit software engineers. A film editor, a musician, a lawyer who understands the technical substrate of what they’re working on, anyone who can add deep craft and leverage AI for the rest is now competing at a level that used to require a team. The flip side: if your edge was output volume rather than judgement, that edge is gone.

The forgetting problem

I recently read this post that describes in a great way one thing I’ve been really worried about (and that I’ve seen others writing with concern about): “The West forgot how to make things, now it’s forgetting how to code”, which draws on a striking historical pattern. After the Cold War, the US stopped producing Fogbank, a classified material used in nuclear warheads. When they eventually needed to restart production, the tacit knowledge had atrophied so thoroughly that they essentially had to rediscover how to make it from scratch. Stinger missile production ran into the same problem after Ukraine demonstrated the need. Something similar happened with the outsourcing of chip manufacturing to China in the last decades, the West is losing its edge on the semiconductor industry.

The pattern is clear: build capability over decades, find a cheaper substitute, let the human pipeline atrophy, enjoy the savings, then watch it collapse when a crisis demands what you optimised away. The post argues we’re running the same risk with code. And the concern about juniors is real. When juniors skip the formative debugging sessions, the mistakes that build the intuition for where things actually go wrong they don’t build the tacit expertise. When the generation that has it retires, that knowledge doesn’t transfer. This is what really worries me about where software engineering is moving

Koshy John introduces the counter-side of this argument in this post: “For years, people have confused software engineering with code production. That confusion is now getting exposed.” The value was always in the judgement: making sound decisions, identifying missing abstractions, debugging reality. And that judgement is built through friction. Through struggle. Through getting things wrong and fixing them.

My intuition is that the people who navigate these issues best are the ones who stay curious about why things work, even when they don’t need to know how to build them from scratch. You can use AI to build the four-layer compaction hierarchy without understanding it. Or you can use it to understand it, and then build something better. The engineers who choose the second path I don’t think will ever be replaced. The ones who choose the first are renting an edge, and the Fogbank example is a reminder of what happens when you rent long enough to forget you ever owned it.

How are you dealing with this change?

Many of the topics I touched on in this post are present in the public discourse (a.k.a Twitter), and are brought up in one way or another in conversations with colleagues and friends. But I would be curious about the first-hand experience and thoughts of the readers of this post.

If I can collect enough testimonials from my readers it would make for a pretty nice follow-up post: my take on the matter v.s. readers’ realities. If you want to contribute with your experience and thoughts shoot me a comment or an email. And in any case, see you next week!

@adlrocha - Towards local plug-and-play AI

adlrocha — Sun, 17 May 2026 08:02:35 GMT

Last week I wrote about the hardware side of running AI locally, why memory bandwidth matters more than raw compute, which machines are worth building, and where the market is heading. If you missed it, start there as this post builds directly on top of it.

In the quest of becoming AI independent, your hardware sets the ceiling, but what decides how close you actually get to it is software.

Two machines with identical GPUs, identical VRAM, identical bandwidth, one running naive inference, one running an optimised stack can produce a 3-5x difference in tokens per second. This can mean the difference from running at 5tok/s to the 20-30tok/s that you need to get something usable. Even more, some techniques, and software-model-hardware optimised implementations may allow you to fit large models like DeepSeeekV4-Flash on a MacBook with at least 96GB of RAM. Same model, same hardware, different choices in the software layer (I feel we have a lot to learn from the video encoding and compression industry in this respect. We have to squeeze the most of every $ of hardware resources).

Last week I presented my new goal to create an inference box that generates tokens fast enough within a certain price point. So far I haven’t found one that fits my needs fully. What I am hoping is that the outcome of this work finds me a hardware configuration optimised for local inference and that is plug-and-play and not absurdly expensive, and/or a tool that detects your current hardware and suggests the best model and configuration for it.

This post continues that search focusing on the latest techniques to improve my inference stack.

MoE vs dense models

Before we get to the software tricks, there’s an architectural decision that sits underneath all of them, because it changes what the software layer has to deal with.

Most of the interesting models you’re likely to run locally with a decent throughput are Mixture-of-Experts architectures. Qwen3.6-35B-A3B, Qwen3-235B-A22B-250, DeepSeek-V4. The naming convention tells you the structure: 35B total parameters, but only 3B active per token. The model is divided into expert sub-networks, and a router that decides which ones fire for each token.

The key advantage of these types of models (and why is the one with biggest changes to fit your hardware) is that If only 3B of 30B parameters do any work per token, you get something close to 3B-scale inference speed while the model carries 30B-scale knowledge. With the right serving trick, like llama.cpp’s -ngl 99 -ncmoe 99 flags which keep the attention and shared weights on the GPU and offload cold expert FFN layers to system RAM, a Qwen3.6-35B-A3B can hit 33.5 tok/s on an RTX 3070 Ti with just 8GB of VRAM, provided you have 64GB+ of fast system RAM for the offloaded experts. The floor for running a 35B-knowledge model just dropped further than most people realise.

The main downside is consistency. And if you have used one of these MoE long enough you have probably experienced what I am about to describe.

Think of a MoE model as a hospital. Each patient gets routed to the right specialist. But which specialist fires depends on the token. The model can feel sharp in one domain and noticeably weaker in another depending on which expert activates. Dense models don’t have this problem because every parameter processes every token, every time. Slower, more expensive, but completely consistent. This lack of consistency can be experienced through tool call loops, to performance degradation and catastrophic forgetting.

For reasoning tasks, for long-context coherence, for anything where you need the model to stay sharp across a 50,000-token context, dense models tend to be better. This is why I would always recommend dense models for any agentic task that requires several assistant turns and accurate context keeping.

There’s a serving problem too: when too many tokens in a batch route to the same expert simultaneously, that expert’s buffer overflows and tokens get dropped silently. The model will not warn you of this happening, and it just gets worse. Dense inference has none of that complexity.

So how do you choose between dense and MoE models? Here’s the practical decision tree that I currently use myself:

8GB VRAM GPU + 64GB system RAM? MoE with expert offload is your only real option for a capable model. Qwen3.6-35B-A3B at Q4 or Gemma4-26B-A4B with llama.cpp offload fits this profile. Throughput will be CPU-bandwidth-bound, not GPU-bound, so a fast DDR5 system matters more than GPU generation here.

16–24GB VRAM (RTX 3090, RTX 4090, RTX 4080)? you have a genuine choice. Dense Qwen3.6-27B at Q4_K_M fits in ~16GB with no offloading, no serving complexity, consistent quality. MoE Qwen3-35B-A3B also fits at this tier with partial offload. If your workload is agentic, i.e. long contexts, tool calls, multi-step coherence, etc. The dense model’s consistency advantage is worth the slightly lower raw throughput (which can be painful as described in my last post).

128GB+ unified memory (Mac M3 Ultra, Strix Halo / Ryzen AI Max+)? This is where MoE becomes unambiguously better. You can hold the entire Qwen3-235B-A22B (235B total, 22B active per token) in unified memory and run it at full context without any offloading. At this tier you get frontier-class capability locally. The 22B active parameters per token still give you strong throughput on unified memory’s high-bandwidth pool. And sharp readers may be wondering, but then why are you running dense models like Qwen3.6-27B on your Strix Halo yourself? We are back to the consistency problem. I am looking to run long-running agentic tasks. The performance quickly collapses as the context grows.

Multi-GPU / 192GB+ VRAM? You can essentially run decent models for whatever you need. MoE at full precision, no quantisation required. The routing overhead becomes negligible at this level of parallelism.

The short version: if you have limited VRAM and a lot of system RAM, MoE with expert offload. If you have enough VRAM to fit a dense model cleanly, prefer dense for anything requiring long-context coherence. If you have a large unified-memory machine, MoE at the top tier, e.g. Qwen3-235B-A22B. becomes the most capable option you can run locally.

Fortunately, Alibaba built both ends of this spectrum with Qwen3.6, and so far I’ve been using them and I am decently happy. The MoE option is Qwen3.6-35B-A3B. The dense option is Qwen3.6-27B. Both run the same hybrid attention core, the choice between them is almost entirely a hardware and workload question, not a model quality one.

How did I get to these numbers? As I was doing the research for this section, I came across LocalMaxxing.com. This site is just pure gold. It provides benchmarks of different models over different hardware architectures, with clear information about the inference engine used and their configuration. This gave me homework for the next few months, and is a great resource for this new quest of mine.

The attention zoo

The MoE/dense question is only one dimension of the model architecture. The other is which attention mechanism the model uses internally. Every variant is a different answer to the same constraint: standard attention produces an N×N matrix where N is sequence length. At 10,000 tokens that’s 100 million entries, and the memory cost scales quadratically. Every interesting development in attention over the last two years is essentially a different attempt to escape that curve without paying too much in quality. Each have its own trade-offs and may fit better different underlying architectures.

Sebastian Raschka’s visual guide is the best resource I can recommend to navigate the attention landscape. The progression is worth following because each step reveals a different tradeoff.

Standard Multi Head Attention is what everyone (or maybe it’s me) thinks of when thinking about the attention mechanism in a transformer, where every token attends to every other token, full stop. Quadratic memory, quadratic compute. Nobody building for local hardware chooses it today; it’s just the baseline everything else is measured against.

GQA (Grouped-Query Attention) was the first fix that actually shipped at scale. Instead of each query head maintaining its own independent key-value projection, multiple query heads share a single KV pair. Roughly 50% KV cache savings, almost no quality loss. Llama 3, Qwen3, Gemma 3 all use it. The tradeoff is minimal, this is why it became the default so quickly.

MLA (Multi-Head Latent Attention), from DeepSeek, goes deeper. Rather than reducing how many KV pairs you keep, it compresses what’s stored in each one, saving a latent representation and reconstructing the full KV state on demand. More complex to serve than GQA, but at scale the quality-per-byte advantage is real. DeepSeek V3 and Kimi K2 use it; it’s the right choice when you have the memory to absorb the reconstruction overhead and want frontier-class output quality. Don’t ask me why, but I personally love models that implement MHA.

SWA (Sliding Window Attention) takes a different angle entirely. Rather than compressing the cache, it simply restricts how far back each token looks. Gemma 3 uses a 5:1 ratio of local to global layers, with a 1,024-token window and GQA on top. Memory grows linearly rather than quadratically, and for most practical workloads (like code completion, document Q&A, chat) the quality impact is minimal. The tradeoff is that you’re genuinely giving up global context on those local layers. Fine for most tasks; matters for very long-range reasoning.

Gated DeltaNet, used in Qwen3.6-27B, pushes further: rather than attending over a stored sequence at all, it maintains a fast-weight memory that gets continuously updated with each new token. Memory footprint stays flat regardless of sequence length. Going from 4k to 65k context costs ~800MB of VRAM instead of several GB, which is the difference between a 16GB card hitting a wall and staying competitive with machines three times its size. The tradeoff is architectural complexity and the fact that very-long-range dependencies that would be trivial for full attention require the model to have learned to compress them into the running memory state.

Mamba-2 hybrids (Nvidia’s Nemotron Nano) are the logical extreme, where most attention is replaced by recurrent state machines with constant memory regardless of sequence length. The right choice for edge and embedded hardware where even GQA is too much. Not the first option when you have a proper GPU available; the quality ceiling is lower, but the memory floor is the lowest in the landscape.

These differences show up on the machine. With SWA at long contexts the KV cache barely moves, the serving engine has headroom it wouldn’t have with MHA. Switch to GQA at the same context length and VRAM climbs noticeably. Run a DeltaNet hybrid and the memory profile nearly flatlines. When you’re fitting multiple agents on a fixed memory budget, which attention variant the model uses matters as much as how many parameters it has.

Even one more dimension on top of this that I’ve decided to leave out of this post and deserve its own post, is which quantisation mechanism to use and the different flavours available (which is a beast in itself). See my post about TurboQuant for a primer on this.

Speculative decoding

A few weeks ago Google announced that Gemma 4 was shipping with dedicated MTP drafters: small companion models that could push inference speed up to 3x without touching quality.

The problem speculative decoding solves is a fundamental one of transformers. As we’ve described a few times in this newsletter, LLMs generate text autoregressively, i.e. one token at a time, each token depending on everything before it. The large model has to do a full forward pass for every single token. You can’t parallelise the generation itself. So no matter how fast your hardware is, you’re paying the full model cost on every step.

The insight is that you don’t need the big model to propose tokens, only to verify them (a bit like speculative execution in processors). This is where the drafter comes in.

A drafter is a much smaller model (or a lightweight prediction head attached to the main model) that runs first and quickly guesses the next several tokens. Then the large target model takes all those guesses and verifies them in a single parallel forward pass. If the drafter’s guesses are right (and at 70-80% acceptance rates on typical queries, most of them are) you get several tokens confirmed in roughly the same time it would take the large model to generate just one. When the drafter is wrong, the target model corrects from that point and the drafter tries again. The output is mathematically identical to what the large model would have produced alone.

The drafter doesn’t need to be good in absolute terms. It just needs to be right often enough on the specific distribution of text the target model typically generates, and a small model fine-tuned on that same domain usually clears that bar easily. As mentioned above, we are getting closer and closer to the kind of optimisations that we made for encoders and processors.

Google’s MTP drafters for Gemma 4 achieve up to 3x speedup with no measurable quality degradation, roughly 2.2x on Apple Silicon at batch sizes of 4-8. Qwen3.6-35B-A3B with MTP enabled via vLLM hits 80 tok/s on 12GB VRAM, compared to 20-30 tok/s without. Using the same hardware by just setting a flag.

For models that don’t ship with native MTP heads, EAGLE-3 takes a slightly different approach: it attaches a lightweight prediction head directly to the target model’s internal layers rather than using a separate companion model. No extra weights to download or host, just a plugin that learns the target model’s output distribution and proposes candidates the main model verifies. Production benchmarks in vLLM and SGLang show 2-3x throughput gains at low-to-medium concurrency.

One caveat: the gains are real at 1-10 concurrent requests but diminish as concurrency rises. At high concurrency the overhead of running the drafter plus verification starts eating into the benefit. For a personal inference box or a small team setup, it’s the highest-return optimisation on this list, but it may not be a silver bullet as soon as you want to serve this to several concurrent users (not my case at least for now).

FlashAttention

We’ve touched on this a few times in this newsletter. The KV cache problem has a structural cause, and as your context requirements grow, this is something that is immediately noticeable (try a Qwen3.6-27B model running in a high-end Strix Halo using Hermes agent’s minimum 65K recommended context, and feel the pain as the context grows and your conversation evolves). Standard attention produces an N×N matrix where N is sequence length. At 10,000 tokens, that’s 100 million entries.

This is why you should always have FlashAttention enabled. First published by Tri Dao’s lab this technique fixes the tiling of computing full KV cache blocks. The model never materialises the full N×N matrix, it computes attention block by block, keeping intermediate results in fast memory. Same mathematical result. Dramatically fewer memory round-trips (enable -fa in lamma.cpp, or the equivalent of your inference engine of choice, to see the magic happens).

Liquid Nanos: a different foundation

Everything above is about making transformers run better with the hardware available, but there’s one approach worth knowing about and that I recently came aware of and aligns with a hypothesis I’ve been having for a while.

Liquid AI built their models on Liquid Time-Constant Networks, a class of neural network originally inspired by C. elegans, the nematode with 302 neurons that navigates and forages with surprising reliability. The underlying idea is the following: biological networks don’t process information in discrete steps, they evolve continuously according to differential equations. The time constant that governs each neuron’s behaviour adapts based on current input. The model isn’t static and it adjusts its own processing as it reads.

The production translation is the LFM2 family: a hybrid with 10 gated short-range convolution blocks and 6 grouped-query attention blocks. Most sequence processing happens in the convolution layers, which scale linearly with context length and carry no KV cache overhead at all. Liquid calls the unifying design ‘Linear Input-Varying operators’, weights generated on-the-fly from input, rather than fixed parameters. And the architecture was found by optimising under real CPU and mobile SoC latency constraints from the start. Hardware-in-the-loop search, not adapted for hardware after the fact.

LFM2.5-1.2B hits 2,975 tok/s prefill and 116 tok/s decode on an AMD Ryzen AI 9 laptop via llama.cpp without GPU. The Liquid Nano models, task-specific fine-tunes for extraction, RAG, and summarisation, range from 350M to 2.6B. The 350M variant runs on a Raspberry Pi 5 in 300MB at int8 quantisation.

The standard instinct with LLMs is to reach for the biggest model you can run. But the Nano line is a concrete example of the alternative: distil the capability you actually need into a model small enough to run anywhere, then compose several of them. A 350M extraction model pulls structured fields from a document. A small RAG model retrieves relevant context. A slightly larger reasoning model makes the decision. Each one doing its piece efficiently, orchestrated into a pipeline, and the total compute cost is a fraction of a single large generalist doing everything at once.

This is, I think, one of the more underrated directions for local inference as the hardware improves. The question stops being “can I fit a big enough model on this machine?” and starts being “what’s the right composition of narrow models for this workflow?” The distillation approach isn’t new but the combination of capable base models to distil from, hardware that can run the small results locally, and serving stacks that already support them changes what’s practical.

This is the approach that we used to build Baselight AI: leverage narrow agents that can run leveraging smaller models, and compose their functionality to achieve the global goal.

Software-Model-Hardware Optimised Implementation

I want to close by sharing something that blew my mind, and that I am hoping to study in detail and come back to soon. An example of optimising the model to the underlying hardware. It reminds me of my time researching video compression. The author of Redis published ds4: a native inference engine for DeepSeek V4 Flash, written specifically for Apple Silicon in a few thousand lines of C. No generic inference engine, no GGUF general-purpose runner abstractions. One model, done properly and over-optimised for a specific hardware architecture.

It runs DeepSeek V4 Flash with a 1 million token context window on a 128GB MacBook Pro. I haven’t tried it myself, but antirez says that “coding agent works great, reliably calls tools.”

The performance numbers from the repo are the following: 26.68 tok/s generation on an M3 Max 128GB. 36.86 tok/s on an M3 Ultra 512GB. Not the fastest numbers you’ll find in this series of posts. But for a coding agent running a full tool-call loop locally, with 1M context, on a laptop that costs less than a used RTX 3090 rig, it’s more than enough. I think this setup would solve my needs

The three engineering decisions made this possible are the following:

Asymmetric 2-bit quantisation. The MoE experts make up roughly 90% of DeepSeek V4 Flash’s parameter volume, but they’re not all equally critical. Antirez applies an aggressive 2-bit compression (IQ2_XXS and Q2_K) only to the routed expert layers, the ones that activate a subset of the model per token. Shared experts, projections, and the routing logic itself stay at full precision. You get most of the memory savings with a fraction of the quality loss. This is the same asymmetric logic that makes MoE offloading work in llama.cpp, pushed further.

KV cache to SSD. The conventional assumption is that the KV cache has to stay in RAM, at 1M tokens it would blow well past 128GB. Antirez treats the compressed KV cache as something that can live in disk: checkpointed to Apple’s high-speed SSD, with SHA1-based resumption across restarts. The engine tracks what’s in memory, what’s on disk, and fetches as needed. It works because Apple Silicon’s NVMe storage, the same architecture I mentioned in the my Apple post, reads at speeds that make SSD-as-extended-memory viable in a way it simply isn’t on a standard laptop.

Pure Metal, no wrappers. Every kernel hand-written for Apple Silicon. No intermediate runtime, no portability tax. The result is that the Metal backend isn’t an afterthought, it’s the target the entire design was optimised for.

What I really love about this work on ds4 is that it aligns this thesis I’ve been having that we need software-mode-hardware optimised implementations (it may be confirmation bias, I’ll take it). But it takes what we’ve been exploring throughout the post to its logical extreme. FlashAttention improved on standard attention by understanding exactly what the hardware does with memory. Speculative decoding improves throughput by understanding how to exploit the verification step. Liquid’s hardware-in-the-loop search improves inference by designing the architecture for the hardware constraint from the start. Antirez did all three things at once, in C, for one specific model, and published it in a few thousand lines.

I feel that we are going to see more and more of this (I personally want to do the exercise of implementing something like this for one of my favourite small models). Specific models optimised for inference in specific hardware. This is what I was referring to when I mentioned that if I can’t get a plug-and-play box for general-purpose inference, at least I want specific usable models that fit my hardware and my use case.

Towards plug-and-play local AI

Last week I wrote about the hardware side of AI independence, focusing on the setup and machines that made local inference viable without spending months of work and breaking your bank account. This week it’s the software side: the optimisations and model choices that turn decent hardware into something that feels fast and usable.

The truth is, the pieces are already here. MoE expert offloading, speculative decoding (MTP or EAGLE-3), FlashAttention, DeltaNet hybrids, narrow distilled models, everything we need exists today in open source. What I feel is missing is the glue: one opinionated stack that just works for your specific machine.

YC entrepreneurship gurus always say that you should build things that scratch your own itch. Plug-and-play local inference on a budget is that itch that I’ve set myself to scratch in the coming months (or years?). I want a tool that looks at your hardware and gives you the best possible setup: right model, right quant, right serving engine, right flags, etc. without you having to navigate the local inference landscape. No more weekend debugging sessions. Just install once and run capable agentic AI locally and a clear expectation of what your hardware will allow you to run (because I’ve also been burnt by false promises of how “useful” a local model could be running on specific hardware).

This lines up directly with the bigger fight @0xSero laid out: making AI education, training, self-hosting, and inference available to everyone on this planet, whatever your budget. Open source has to win. The alternative is locking frontier capability behind cloud subscriptions and six-figure rigs.

I’m working towards this starting with serious benchmarking on the Strix Halo (because it’s the hardware that I have). The more real data we have, the better the auto-config can become.

I’m completely open to collaborations on this. If you have hardware to test on, a rig or box that you don’t use anymore and that you would be willing to donate for the cause, engineering time, or want to help push the benchmark suite forward, reach out.

Until next week!

@adlrocha - In a quest to becoming AI-independent

adlrocha — Sun, 10 May 2026 08:02:45 GMT

A few weeks ago, GitHub announced that Copilot is moving to usage-based billing. No more flat subscriptions, from now on everyone has to pay for the tokens they use.

If you’ve been using Copilot on the free tier or an individual plan (like it was my case through a benefit to active open-source contributors), this probably stings. This subscription was the perfect way to test every new model without having to commit to specific subscriptions, and with an extremely generous monthly quota. I know of many people that bought Github Copilot subscriptions over Anthropic ones because it gave you access to Sonnet and Opus with higher quotas than those provided in Claude. So the obvious question is, why was it so cheap?

The answer is definitely not generosity. It is well-known that AI labs and big tech have been subsidising token costs for the same reason any platform subsidises onboarding: to build dependency before they extract value and crush their competition. Every cheap API call is also a training data point. Every workflow you wrap around their service is a switching cost they’re accumulating on your behalf. GitHub Copilot at $10/month was never a sustainable product, like is probably the case for more popular products like Claude Code and Codex. It was a land grab dressed up as a subscription. The cost per user of all these AI subscriptions (at least from the well-funded companies that can afford it) significantly exceeds the price of their subscriptions.

My most loyal readers know how I’ve been concerned about the economics of AI for a while. In this post I already made my argument about how I think “the AI Bubble is more a trap than a bubble”, and how by accelerating the adoption of AI for our daily workflows, companies are trying to create a dependency that they can leverage. When I realised this by the end of last year, I decided to start buying hardware that I could use to run local inference in order to start minimising my dependency from big token bills and subscriptions with decreasing token allowances.

My journey started with a Strix Halo chip, the Ryzen AI Max+ that has become my daily driver and gives me up to 128GB of unified memory. This machine allows me to comfortably run Qwen3.6-27B and Gemma 4 locally for my LLM-powered background tasks. Think email and calendar digest, meeting summaries, TTS, etc., the kind of assistant and automation work that doesn’t need a fast feedback loop or large contexts and can run continuously in the background. This allows me to prevent an increased AI bill, and to unnecessarily drain the token quota of my subscriptions, which I desperately need for more complex agentic tasks.

While this setup works fine for this kind of use case, it has shown to be quite annoying when you want to start leveling up your game and let your agents start relying exclusively on local models. The key problem is throughput. Even if the model fits in memory, as soon as you need to support an application that requires large context, tight feedback loops like agentic coding, auto-research tasks, real-time tool calls, or even running OpenClaw or Hermes agents, the tokens per seconds required to make the experience bearable (at least for me) aren’t there yet.

Fortunately, this gap is solvable, but today it may cost a few thousand dollars. So before spending a few “Ks” on hardware I wanted to be really sure and understand the setup that would give me what I need. This post is my public report of all my findings.

How inference actually works

But before we get into the hardware, it’s worth refreshing what “inference” actually requires, because the specific hardware requirements that matter, and how they impact your user experience, may not be the ones that intuitively many people think.

There are three main resources in play at inference: memory capacity (whether the model fits at all), memory bandwidth (how fast weights and caches stream into the compute units), and raw compute (how fast those units do the maths). Most people focus on the third one, while the bottleneck is almost always the second.

Here’s why. An LLM generates text one token at a time, autoregressively. Each token requires reading a large chunk of the model’s weights from memory into the processing units. The weights themselves don’t change (you’re not training, you’re reading). Which means the question isn’t “how many FLOPS can this chip do?” but “how fast can it stream data from memory?“ That memory bandwidth is what matters, measured in GB/s.

To give you some numbers that can help you build your intuition, an RTX 3070 with 8GB of VRAM has 448 GB/s of memory bandwidth. A newer RTX 4060 Ti with the same 8GB has 288 GB/s. For inference throughput, the 3070 which is older and cheaper, can be faster at inference as long as it can fit the model. This is counterintuitive until you understand what’s actually being measured. Apple understood it early, even if by accident, with the unified memory architecture in M-series chips, where CPU, GPU, and Neural Engine share a single high-bandwidth pool with no bus crossings, turns out to be nearly optimal for exactly this kind of workload. This is what makes Apple devices with M chips so good at inference. I wrote about why a few weeks ago.

The other bottleneck you need to understand is the KV cache. When a model processes a long conversation or code context, it caches the key and value vectors from each attention layer for every token it’s seen so it doesn’t have to recompute them. This cache grows with context length. At 200k tokens, it’s roughly 2GB with FlashAttention on, something manageable. But without optimisation, long contexts can eat most of your VRAM before the model weights even load. Newer architectures like Qwen3.6 address this directly: only 10 of the model’s 40 layers use full KV cache, meaning going from 4k to 65k context adds roughly 800MB of VRAM rather than several gigabytes. Architecture decisions like this are why “how much VRAM does it need?” is a question that increasingly depends on which model you’re running, not just how many parameters it has. If you want a deeper view on how transformers and KV caches work, I also shared a brief overview with external pointers on this post.

What does this mean for agentic work specifically? Tok/s matters more than it does for a chatbot. When an agent is executing a loop (calling a tool, parsing the output, deciding the next step) latency compounds. At 5 tok/s you’re waiting seconds between loop iterations. At 40 tok/s the loop feels instant. The difference between a useful coding agent and one you give up on is often that narrow. And this is the pain that I am feeling with my current setup. These half hundred tok/s is what I want to aim for with my next setup.

What the hardware market looks like

I’ve spent a long time in the weeds on this, and a lot of my thinking has been shaped by 0xSero’s detailed breakdown of the current market, and all the experiments he keeps sharing publicly (if you don’t follow him already and you are interested in local inference I highly recommend you do it right now. And 0xSero If you end up reading this, I can’t thank you enough for your contributions and all the good you’ve done for the open-source AI and local inference community). Here’s how I’d summarise the options as of mid-2026, capped at roughly $10k for an end-to-end inference machine built upon 0xSero’s analysis and benchmarks, and my own research.

Before I share the actual builds, here’s a summary table with the high-level hardware numbers from the previous section. As a reminder, memory capacity tells you which models fit, memory bandwidth tells you how fast they run. The table below puts those side by side so you can read the trade-offs against the metrics that actually matter.

With that framing, here’s the detail on each.

Source: 0xSero

Mac M3 Ultra

The cleanest option. Apple Silicon’s unified memory architecture (CPU, GPU, and Neural Engine sharing a single high-bandwidth memory pool) turns out to be nearly ideal for inference. No bus crossings, no transfer overhead. MLX has matured significantly in the last few months (as I described here) and is approaching the throughput of an Nvidia 3090 on comparable tasks. At 400W peak, the whole machine uses less power than a single overclocked 3090.

The biggest advantage is capacity: 512GB of usable memory means you can run Kimi-K2, Deepseek, and Minimax-M2 at full context, without extreme quantisation. Network two of them and you hit 1TB, something that would cost north of $50k with Nvidia. Scaling is quite clean in this case, each additional machine is its own self-contained unit with its own software stack connected through Thunderbolt/Ethernet.

The key limitation here is the lack of CUDA support. A lot of tooling in the inference ecosystem like vLLM, SGLang, the training and fine-tuning stack, assumes CUDA. MLX is good and getting better, but its level of maturity is still not close to CUDA’s. If you want to also fine-tune or train on your inference box, this may not be the best solution. But for inference? It’s great!

8× Nvidia RTX 3090

This is the power-user option, and the one that requires the most assembly work. There is no pre-built version of this; you are building a workstation from parts.The shopping list looks something like this: a server-grade motherboard with at least eight PCIe slots (something like a Gigabyte MZ32-AR0 or Supermicro equivalent, $800–1,200), a server chassis or open-air mining frame ($200–400), a 2,000W+ PSU or dual PSU setup ($400–600), 256GB of DDR5 system RAM for MoE offloading ($400), and eight RTX 3090s at roughly $800–1,000 each used. Total: $9–12k if you buy carefully, more if you don’t (which is always my case :) ). You will spend a weekend on this. Then another weekend on NVLink bridges and driver configuration.

What do you get in exchange? 192GB of VRAM at 936 GB/s of aggregate bandwidth, the fastest throughput on this list for dense models. Full CUDA support means vLLM, SGLang, and anything else the ecosystem has produced. A mature ecosystem and a box where you can also train and fine-tune.

The main downsides of this setup is that at full tilt the system draws 1,500W even with cards capped at 50% power limit. It will be quite noisy. The used 3090 market is tightening. Scaling beyond 8 cards requires an electrician and a second system. Think of this as a serious workstation close to data-centre level, not a quiet office machine.

If you like hardware and building your own machines, this is a really fun project. But if you don’t have the time this one is probably a pass for you, even if the economics per GB of VRAM add up.

Ryzen AI Max+ / Framework Desktop

This is the chip in my own Beelink machine. Framework sells a desktop configuration with 128GB starting at around $3k, expandable in 128GB increments up to 384GB really similar to the one I have. Mine includes 128GB, and you can buy it configured and it arrives ready to run, no assembly or heavy work needed. The power draw is modest, it’s quiet, and the RAM expands by swapping sticks rather than adding cards. I’ve been running non-stop for the last six months without noticing anything on my electricity bill.

The same chip, the Strix Halo, is what 0xSero describes as bringing the cost-per-GB-of-memory down “an absurd amount” relative to Nvidia. At 128GB you’re past the capability of four 3090s for half the price and a tenth of the hassle. Simon Couch has a good post showing what day-to-day local agent workflows look like on this class of machine. The memory architecture is similar in principle to what Apple is doing, unified pool, high bandwidth, no bus penalty, which is exactly why it’s competitive on inference despite the software friction.

The catch: ROCm instead of CUDA. AMD’s software stack has improved considerably, but it still requires more configuration than CUDA-based workflows, and some tools simply don’t support it. I personally faced some issues with Strix Halo’s ROCm support for the kernel version that I was running, which pushed me to run my models in Vulkan. The performance degradation is negligible, but you still have to go through some hoops compared to CUDA’s support.

Supply has also been inconsistent, Framework’s configurations sell out and wait times stretch to weeks. Scaling horizontally (multiple machines networked together) is possible but requires more work than adding a card to a PCIe slot, although you can always connect

Nvidia RTX 6000 Blackwell

The option for people who want to start small and scale without rebuilding. A single RTX 6000 Blackwell is a PCIe card, it slots into any workstation motherboard with a x16 slot, which means the rest of the machine (CPU, RAM, case) can be modest consumer hardware at $500–800. One card is ~$7–10k and gives you 96GB of VRAM at roughly 1,700 GB/s, faster per-card bandwidth than the entire 8× 3090 build. Two cards doubles the VRAM to 192GB at half the power draw of eight 3090s. You can reach eight cards on a household circuit, landing at 768GB of VRAM, the practical ceiling for residential power.

The per-GB cost is the highest on this list. But you’re buying a 5-year upgrade path. Add one card per year, keep everything else the same. No new chassis, no new PSU configuration, no rebuilding the stack. For people who want to grow an inference cluster incrementally, this is the most coherent architecture (albeit the entry level is quite expensive).

Huawei Atlas 300I Duo

The wildcard. $10k buys you 480GB of VRAM, a number that’s hard to match anywhere on this list. vLLM support exists now, which has changed the viability picture considerably. At 400 GB/s bandwidth per card it’s not the fastest, which limits tok/s on dense models, but for running very large models at lower throughput requirements it’s hard to beat on cost-per-GB.

The bigger issue is the ecosystem: debugging means translating Chinese forums, GitHub issues go unanswered for months, and for US-based buyers the import situation can be complicated.

Worth knowing about. Probably not your first machine.

tinybox

The tinybox from the tinygrad team is the closest thing to a plug-and-play inference machine you can buy today, pre-assembled, pre-configured,it ships on Ubuntu 24.04 with the tinygrad software stack already running.

The tinybox red v2 is the AMD option and the one that fits a realistic home inference budget. Four AMD Radeon RX 9070 XT cards, 64GB of total GPU RAM, 2,560 GB/s of aggregate bandwidth, a 32-core EPYC CPU, 128GB of system RAM, and a 2TB NVMe, all in a single 1,600W supply enclosure for $12,000. By bandwidth it punches well above the 8×3090 build described above, at a fraction of the assembly headache. The ceiling on model size is lower than the Nvidia options (64GB VRAM fits quantised 70B-class models comfortably), but for throughput on models that fit, the bandwidth numbers are serious. ROCm applies here as with all AMD hardware, but tinygrad’s stack abstracts most of it.

The tinybox green v2 has moved to a different category entirely. The current version ships four RTX PRO 6000 Blackwell cards, 384GB of total VRAM, 7,168 GB/s of aggregate bandwidth, 3,086 TFLOPS FP16, all for $65,000, made to order, and requires a concrete slab. It is no longer a home inference box. It is a small data centre. Worth knowing it exists; not relevant to this post’s scope.

The tinybox red v2 is the reference plug-and-play option for this discussion. You don’t build it, you plug it in. If the trade-off of ROCm for zero assembly time sounds right, this is currently the clearest path to a working inference machine out of the box.

Non-Nvidia GPUs

This is a work in progress in my research, and something that I am seriously considering. Nvidia GPUs tend to have a higher price than AMDs for comparable hardware capabilities.

The RX 7900 XTX is the AMD equivalent of the RTX 4090: 24GB of GDDR6 VRAM, ~960 GB/s memory bandwidth, available for roughly $900–1,000. By bandwidth it actually trades blows with the 4090 (which sits at ~1,008 GB/s). The VRAM is identical. The price is meaningfully lower. An 8× RX 7900 XTX build lands at roughly $7–8k for the cards alone, comparable to the used 3090 build on cost, with newer silicon and better per-card bandwidth. The assembly requirements are the same: server motherboard, beefy PSU, a weekend of your life.

The RX 7900 XT (20GB, ~820 GB/s, ~$700 new) is worth knowing about as a budget-tier option if you want AMD silicon without paying 4090 prices. Less VRAM per card means a lower ceiling on model size, but in a multi-card build you can still reach 160GB across 8 cards.

On the workstation side, AMD’s Radeon PRO W7900 (48GB VRAM, ~864 GB/s, ~$3,500) is the closest AMD equivalent to the RTX 6000 Blackwell, a single professional-grade card with serious VRAM, designed for long-term workstation use. Two of them give you 96GB at roughly $7k, competitive with one RTX 6000 Blackwell in capacity but significantly cheaper. The bandwidth per card is lower (~864 GB/s vs ~1,700 GB/s), which shows up in tok/s on dense models.

The ROCm caveat applies uniformly across all of these. vLLM and llama.cpp both have ROCm support and it has improved substantially — most workloads that run on CUDA will run on ROCm with some configuration friction, and for inference specifically (as opposed to training) the gap is smaller than it was a year ago. The tools that still don’t support ROCm reliably are the fine-tuning and training frameworks. If your use case is pure inference, AMD is a legitimate option at every price tier. If you want to train or fine-tune on the same hardware, CUDA is still the safer choice.

To end this analysis, if you are curious about what others are running, here’s a recent study from HuggingFace with the top100 most popular hardware setups. Here’s the top 10 setups:

Where the hardware is going

Everything above operates inside the GPU paradigm. That’s where the market is today. It’s worth spending a moment on where it’s going, because the three numbers from the first section (capacity, bandwidth, compute) are exactly what purpose-built inference hardware tries to optimise, and there are early signals that something meaningfully different may be coming that can completely change the local inference landscape.

Recall the core constraint: inference is memory-bandwidth bound. A GPU spends a large fraction of its silicon on graphics-specific logic (rasterisation pipelines, render targets, display controllers) that contributes nothing to matrix multiplication. That’s headroom a purpose-built chip can reclaim.

Talos V2 is a small open-hardware project that I recently came across and illustrates this directly through an FPGA-based inference board. Luthira Abeykoon built Talos specifically for transformer inference and published a head-to-head benchmark against an Apple MacBook. The numbers aren’t yet flattering for the FPGA, Apple Silicon wins on raw throughput, but the benchmark is honest about why: the FPGA’s memory bandwidth and capacity are still the limiting factor, not the compute logic. That’s the same constraint we’ve been discussing all along. What the project demonstrates is that if you wire the hardware directly to the transformer computation pattern, you eliminate the GPU overhead entirely. The floor for what’s possible on custom silicon is lower than the GPU paradigm suggests.

The more commercially developed version of this idea is Taalas (already discussed in this newsletter), which is building inference-specific accelerators designed from the ground up around the bandwidth and memory access patterns that transformers actually need rather than the patterns a graphics card was designed for. And Cerebras, whose wafer-scale chips put the entire model on a single die, eliminating inter-chip communication latency, represents the extreme end of the same logic: if memory bandwidth and model capacity are what matter, what happens when you remove the memory bottleneck entirely by fusing compute and memory into one structure?

These are not plug-and-play products today. Cerebras is data-centre hardware. Taalas is early-stage. But the direction is consistent with what happened to every other class of compute that started general-purpose and got specialised over time: GPUs themselves, Apple’s Neural Engine, Google’s TPUs. All these architectural innovations will eventually permeate into the retail (assuming that they are scalable) allowing for the manufacturing of specialised inference accelerators that allow running inference locally at a decent price (at least that’s my dream).

The practical implication for the builds above: the gap between “what you can run on $10k of consumer hardware” and “what a cloud API gives you” is closing faster than it was two years ago, and it will continue to close. MoE architectures make very large models accessible on modest VRAM. Quantisation research (see the TurboQuant work I wrote about earlier this year) keeps compressing the memory footprint of capable models. Google’s recent multi-token prediction improvements for Gemma4 (which I am planning to talk about in detail in a follow-up post) And eventually, purpose-built inference chips will arrive at the price point where they belong in a home inference box. The question of whether you need a $10k machine to run something genuinely useful has been getting a faster “no” every quarter.

Becoming AI-independent

I keep coming back to the following analogy when I talk about the benefits of local inference, and where the market may end up moving, to friends and colleagues. A decade ago, some homeowners started installing solar panels. Not because it was the cheapest energy in the short term, because it wasn’t, and the payback periods were long (solar cells were expensive). They did it because they wanted independence from the central grid (where prices fluctuate a lot, and it can be unreliable or inexistent depending on the location).

I feel we may be seeing this same trend for AI, as it becomes more and more of a utility. As proof of this, Nvidia and SPAN recently announced a partnership to deploy AI data centres in residential back gardens. GPU nodes integrated into home energy systems. A flavour of this is what I really need for myself.

It makes me really nervous to depend on all these AI subscriptions and token bills. I want AI independence. I already installed solar panels at home, I now want my own inference cluster. Unfortunately, I haven’t found the perfect solution where I get a plug-and-play inference system at home at a reasonable price like is currently the case for solar panels. Projects like tinygrad are getting close, but it is still a bit expensive (and probably an overkill) to what I need.

The Framework Desktop is almost there in terms of price and convenience but it is not quite there in terms of hardware requirements, and the Mac Ultra is just what I need but still expensive. A flavour of some of the configurations described above: like a small server with a RTX3090 expandable to more is close, but just still have to make your shopping list, find the right hardware, and maybe even benchmark it.

I feel increasingly convinced that there’s space in the market for someone selling the perfect well-optimised AI plug-and-play expandable inference box for 2-5K$. Something that allows you to slowly grow your inference cluster to your needs and well-benchmarked so you know what to expect in terms of tok/s and models that you can run.

I am convinced to the point that I’ve decided to start building and benchmarking these inference boxes configurations myself to try and get to this perfect device that can fill the gap in the market. Three things can happen:

I spent a few thousand dollars in experiments and I have some useful hardware at home that I can tinker with
I find the perfect setup and I manage to partner with someone to help me manufacture and ship at scale these AI inference boxes.
I managed to find a configuration that others can copy to offer similar devices.

In all three options I manage to get the inference box that I am looking for, with 2 and 3 solving the problem for others that may need to scratch a similar itch.

If you want to contribute to this endeavour help me either with input, suggestions, or sponsoring the effort. Anything is welcome. This is still a really raw idea that I am trying to shape. You know my email :)

Until next week.

@adlrocha - The Eval Problem: How to Test AI Agents When They Never Give the Same Answer Twice

adlrocha — Sun, 03 May 2026 08:01:34 GMT

I wanted to close this series of posts from the past few weeks on agent engineering with the top of mind problem of everyone that starts building an agent to deploy it in production, but that I don’t see nearly enough people talking about openly.

How do you test that an agent works as intended in the hands of users? From my conversation with other people that are running agents in productions, it feels like we still haven’t found the silver bullet just yet (at least to my knowledge, if this is not the case, please shout!).

Two weeks ago I went through the Claude Code. Last week I covered Hermes, opencode, pi, and how the open-source community was solving similar problems to the ones faced by the Claude Code team (and honestly, everyone building agents these days). In these posts I intentionally didn’t go deep into how these projects were testing themselves, because I felt this deserved a post of itself.

This topic has been nagging me since we released the first LLM-powered feature for Baselight, the SQL error assistant. How can you feel confident that you’ve tested the hell out of your system so that it will operate as intended in every possible case the moment you release it to users, considering the stochastic nature of LLMs? How do you catch regressions before your users do? How do you even define what “working as intended” means when the system’s outputs are stochastic?

This post is an attempt to to share all that we explored in the process of building Baselight, complementing it with all that I’ve learnt from the source code of the agents from the last few weeks, with the goal of shedding some light on the topic of agent evaluation (hopefully sparing you some research and a few experiment).

The problem is not just the stochasticity

When engineers try to test an AI agent for the first time, the stochasticity is what stops them (at least this happened to me). You write a test, it passes, you run it again, it fails, same input, different output. The result of the test is not necessarily wrong, but it just doesn’t match 1:1 the assertion that the code was expecting for the test. Traditional assertion-based testing breaks immediately. You can’t assert character-by-character what the agent will output for a specific input.

The first time I faced this problem myself was when testing the error assistant. I built a testing harness where each test case received an SQL query with errors, an error message, and the corrected SQL query. Given the input, the error assistant returned a corrected SQL query, and the result of the test’s expected query should match the one generated by the error assistant. Simple enough, right?

Not really. It was a fucking mess. For simple tests the correct result and underlying assertion to be made was obvious, and the result of the expectation always matched the one generated by the agent. But with more complex queries there was ambiguity caused by the error that resulted in a lot of flaky tests, with expected results sometimes not matching exactly the result of the generated query.

That’s bad, but it’s not the deepest problem.

The deeper one: before you can ask “did the agent succeed?”, you have to answer “what does success look like for this task?” For narrow, well-defined tasks like the case of the error assistant that’s tractable. A SQL query either runs without errors and the result matches the original user intent, or it doesn’t. A response either cites a number that appears in the data or it doesn’t. But as tasks get broader and the catalogue larger, “success” starts to blur. How can we, for instance, assess objectively if the Baselight agent succeeded in performing a data analysis for a specific topic? We are back to a similar problem to the one I presented in my auto-research post. It all boils down to choosing the right evaluation metric.

At Baselight, we have an AI agent that lets users query and explore large datasets through natural language. A response can leverage the right datasets for the analysis, perform the right set of queries, but draw the wrong conclusions from it. A response can correctly identify a relevant dataset but miss a more relevant one three entries down in the catalogue. A response can produce a valid chain of SQL queries that answers a slightly different question from the one the user actually asked. None of these are errors in the traditional sense. All of them are failures from the user’s perspective.

Defining what “good” means, precisely enough to check it automatically, is most of the work. Everything else that you can think of (tooling, frameworks, running the eval harness) is in service of that definition. You can leverage an off-the-shelf tool or build it yourself.

The two layers you need

From what I’ve seen in the open-source codebases and from building our own eval infrastructure, there are two distinct layers to testing an agentic system. Each one catches things the others miss.

Layer 1: Scaffolding unit tests. The parts of your agent system that are deterministic like prompt strings, compression logic, tool dispatch and their underlying logic, permission checks, CLIs, etc. These should be tested aggressively as ordinary software with the tools and techniques we are used to (from unit testing to end-to-end integrations).

Hermes’s most distinctive test file is test_prompt_builder.py, which imports guidance constants and asserts on their content directly:

def test_memory_guidance_discourages_task_logs(self):
    assert “durable facts” in MEMORY_GUIDANCE
    assert “Do NOT save task progress” in MEMORY_GUIDANCE

This is testing a prompt string the same way you’d test a function contract. If someone edits MEMORY_GUIDANCE and weakens the instruction, CI fails. The invariant, which we discussed last week as a real class of agent bugs, is now enforced by the test suite, not by convention. The compressor logic, the injection scanner, the deduplication rules: all of these are deterministic given their inputs. The fact that they serve an LLM doesn’t make them untestable and they can be tested like any other ordinary piece of software for the pre-LLM era, it is just deterministic code.

Something that I saw in Herme’s source code and that I personally liked to do before LLMs (but that I feel has become more important considering their stochastic nature) is issue-tagged regression tests. When a production failure happens, you write a test named after the issue before closing it. The test suite becomes a map of failures your system should never repeat.

A good example of this is Hermes’s tests/run_agent/ reads like a bug tracker: test_413_compression.py, test_1630_context_overflow_loop.py, test_860_dedup.py. Each file is a specific production failure, diagnosed, fixed, and locked in. test_413_compression.py mocks an OpenAI client returning HTTP 413, then asserts the agent compresses and retries rather than aborting. The whole test runs in milliseconds but encodes an invariant that future refactors can’t quietly break.

Layer 2: End-to-end evals with real model calls. This is where the hard problem lives. You run real prompts through the real pipeline, against real data, and score the outputs. No mocking the model, if there’s a bug in your context management or your tool composition, this is where it shows up.

The cost problem: you can’t run everything on every commit

Layer 2 is where the real cost hits. Running end-to-end evals means making real model calls, the same model your agent runs in production, with the same token budget per task. A suite of 50 eval cases, each taking 5-10 model calls to complete, can burn through thousands of tokens per run. Run that on every commit in CI and on every developer’s machine during local iteration and the bill adds up fast, before you’ve shipped a single feature (been there, done that).

The practical answer is the same one traditional software uses for slow integration tests: don’t run everything everywhere. Layer 1 scaffolding tests are fast and free, run those on every commit and in every dev environment. Layer 2 end-to-end evals are expensive and slow, so they need a different trigger. What we’ve landed on ourselves: Layer 2 runs on PRs that touch the agent’s code surface, e.g. the system prompt, tool definitions, context management, retrieval logic, etc.

And if the change only impacts a specific narrow agent with a specific surface, we try to limit the testing surface to the minimum. A commit that only touches the UI or a background job doesn’t need to burn tokens on a full eval run. The CI pipeline needs to know which files map to which test surfaces, which is a small upfront investment that pays for itself quickly.

The corollary is that your eval suite needs to be stratified. Some cases are fast and cheap enough to run broadly; others like the long-horizon tasks with 10+ tool calls and a 15-minute timeout, should only run pre-release, not on every PR. Tag your test cases by cost and surface, and let the CI configuration decide which tier to run based on what changed.

Defining success: what we actually measure in Baselight

Layer 2 is only as good as your success definition. At Baselight we broke this into three dimensions, each of which required a different measurement approach.

Search quality. The agent’s first job is finding the right dataset for the user’s question. This sounds like it should be easy to measure, either it found the right dataset or it didn’t, but in practice the catalogue is large, datasets overlap in what they cover, there are datasets with similar data, and “right” depends on the user’s intent in ways that aren’t always explicit in the query. We ended up with a catalogQuality scorer: an LLM judge that takes the user’s question, the datasets the agent surfaced, and a reference set of expected datasets, and scores whether the agent’s catalogue search was on target.

This is imperfect. The judge can be wrong. But the alternative, not measuring this at all, is worse, because for us search quality is key to get a great ouput. A good analysis with the wrong dataset (e.g. due to outdated data or from a non-authoritative source) is way worse than a response that admits it couldn’t find relevant data. When you lead with 70 thousand different datasets on various topics with community contributions you operate at a scale where search is a great percentage of the success of the result.

Query success rate. Does the SQL the agent generates actually run without errors, and does it return data? This is the most tractable of the three, it’s nearly deterministic (and one of the value-adds of Baselight, its ability to audit all the side-effects of the agent’s chain-of-thought). We track errorRate (fraction of tool calls that failed) and queryQuality (did the SQL actually answer the question?). The latter is still LLM-judged, but the former is a simple counter. In practice, errorRate going from 2% to 8% is often the first signal that something broke (a schema change, a model regression in SQL generation, a context window issue causing the agent to lose track of which table it was querying, etc.).

Factfulness. Can every claim in the response be traced back to data that actually appeared in the tool call results, rather than being generated from the model’s training weights? This is the hardest to measure. We have a dataQuality scorer that checks whether specific numbers and facts in the response appear in the tool results. It’s imperfect, the model can paraphrase or aggregate in ways that make tracing difficult, but it catches the obvious failure mode: the agent making up a statistic rather than fetching it.

None of these metrics are perfect. The useful ones catch the obvious failures reliably and the subtle ones directionally.

Existing Tooling

Once you have a success definition, you need infrastructure to run evals at scale. A few options worth knowing about.

Evalite is what we built on, a vitest-based framework by Matt Pocock that runs .eval.ts files the same way vitest runs .test.ts files. The model is clean: a data function providing test cases, a task function running the actual pipeline, and a scorers array scoring the output. Results persist to SQLite, a web UI runs at localhost:3006, and the whole thing integrates into a standard CI pipeline. The framework ships with built-in scorers (exactMatch, answerSimilarity, faithfulness, toolCallAccuracy) plus hooks for custom LLM-judge scorers. For us, the key constraint was running evals sequentially (maxConcurrency: 1) against a shared database, and the 15-minute per-test timeout to accommodate long agentic tasks. Off-the-shelf frameworks often assume fast, independent test cases, and that assumption breaks for agents (we may need to rethink currenting testing and CI infrastructure, but as I tend to do lately, that’s a topic for another day).

Braintrust is the most complete commercial option for teams that want offline eval plus production observability in a single platform. It connects the evaluation loop directly to production traces, you can spot a failure in a real user session and turn it into an eval case without leaving the tool. The pricing model is opaque, but for teams that want a “batteries-included” solution without building the infrastructure themselves, it’s the most serious option right now.

Langfuse is the open-source alternative, self-hostable, OpenTelemetry-compatible, with a strong tracing and monitoring story. If you’re already running PostHog for product analytics and want a tool focused specifically on LLM traces, Langfuse fills that gap cleanly. It doesn’t have Braintrust’s integrated eval loop, but it gives you full visibility into what the agent is doing in production.

For Baselight, we already have PostHog LLM Analytics in place with generation tracking, latency, cost, and trace visualisation. That covers the observability layer. What it doesn’t do is run systematic evals or score outputs; that’s what evalite handles. The two tools cover different parts of the problem.

There’s also DeepEval, which has thought carefully about the two-layer structure of agentic systems: the reasoning layer (the LLM’s planning and decision-making) and the action layer (tool execution). Their framework distinguishes between ToolCorrectnessMetric (did the agent select the right tool?) and ArgumentCorrectnessMetric (did it pass the right parameters? “Calling the right tool with wrong arguments is just as problematic as calling the wrong tool entirely.”) That framing is useful even if you’re not using their framework, because it forces you to attribute failures correctly: a bad SQL query might mean the model reasoned incorrectly, or it might mean it called the right tool with the wrong schema.

One thing that I think is key to understand, and the reason why you need a good production observability system is that your production conversations are your best source of new test cases. Every real user session is a data point about what queries your agent actually faces, which tool call sequences it takes, and where it goes wrong in ways your synthetic (or intuition) test cases didn’t anticipate.

This is what we used to grow from a few dozen tests to a more robust harness. It is important to have a systematic way of pulling failures and interesting edge cases from production traces into the eval suite. A query that broke in production, once understood and fixed, should become a regression test. A pattern of queries where the agent consistently takes four tool calls to do something that should take two should become a new scorer. Doing this at scale is a complete pain (and not a solved problem), and for us there’s still a lot of manual work, but we will get there.

This is where the observability layer connects back to the eval layer. PostHog traces tell you what happened. The eval suite tells you whether it was good. The loop between them is what keeps the eval suite from going stale. Without it, you’re testing against the problems you anticipated when you wrote the cases, not the problems your users are actually hitting.

Handling the stochasticity

Back to the problem that stops teams first. If you run the same prompt twice and get different outputs, what does “pass” mean?

Anthropic’s engineering blog on agent evaluation offers the clearest framework I’ve seen. They distinguish between two metrics:

pass@k: at least one of k attempts succeeds. Useful for capability questions, i.e. “can the agent do this at all?”
pass^k: all k attempts succeed. Useful for reliability questions, i.e. “can I deploy this to production and trust it won’t fail 40% of the time?”

These are different questions and they’re often conflated. A 70% per-trial success rate gives you a pass@3 of 97%, the agent almost certainly succeeds if you give it three tries, but pass^3 of 34%, which means it fails all three about a third of the time. Whether that’s acceptable depends entirely on whether your product gives users a retry button or treats the first response as final.

For broad analyses with a large dataset catalogue, we find that larger tasks also reveal model quality differences more clearly. With a narrow, well-specified task, weaker models still complete it most of the time. With a complex analysis requiring multiple tool calls, search across a large catalogue, and synthesised reasoning, the gap between a stronger and weaker model becomes visible quickly, and the degradation compounds step by step.

The practical upshot: run your eval suite 3-5 times per release and look at the distribution, not the point estimate. A 3% mean improvement that’s significant in a t-test is probably real. A 3% improvement on a single run is probably noise. Then the key question here gets back to defining success: can your product actually afford a pass^k lower than a 100%? But that’s a hard one to get. At least for us in Baselight, data analysis can run for several minutes, include large queries, and require several interactions. We can’t afford requiring users to make several attempts to solve their problem.

The open problem: the vibes test

Here’s the honest state of things.

We have layer 1 covered. Layer 2 is running and catching real regressions. But there’s still a stage in our release process that I call the vibes test: someone deploys to staging, runs a handful of queries manually (thank you, Michal :) ), and says “this feels right” or “something’s off.” It’s manual, it’s subjective, and it’s the only thing catching certain regressions that the automated suite misses.

I don’t love this. It doesn’t scale, it’s dependent on having someone with a good intuition for the product (like Michal), and it catches things inconsistently. But I haven’t found a way to fully automate it away for now.

What the vibes test is catching, I think, is the thing that Anthropic’s guidance calls “transcript review”, reading actual agent conversations to build intuition for failure modes. The automated scorers measure specific, pre-defined dimensions. The vibes test catches things outside those dimensions: a response that technically scores well on all five scorers but has a weird hedging pattern that suggests the model didn’t really understand the question. A tool call sequence that’s technically correct but takes twice as many steps as it should. A tone shift that suggests the system prompt got confused somewhere.

Making the vibes test smaller, by surfacing these patterns through automated scoring is still a work in progress for us. The most promising partial answer I’ve found comes from search systems and I read it from a post with this one already drafted.

Mercadona’s (the Spanish chain of supermarkets) deploy pipeline rejects models that degrade any of four metrics beyond -2%, with a one-hour hold before activation that allows a human abort. Mercadona, which processes 4.4 million queries a week through a hybrid search pipeline, maintains what they call a golden set: 500 manually annotated queries with correct answers, kept immutable and never updated from model outputs. The reason: if you refresh your golden set from production data, and that production data reflects what your current model returns, you eventually train the evaluator to rate your model’s style as correct rather than rating actual correctness. The eval contamination problem.

The practical upshot for agent evals: keep a frozen core of carefully curated cases that never gets touched, and add new cases from production to a separate, growing set. The frozen core is your regression baseline. The growing set is where you discover new failure modes. Applied to agent evals: automated scoring against the frozen golden set, with a threshold gate that blocks the release and pages someone when any scorer degrades beyond a defined bound. The person doing the vibes check then has a concrete baseline to objectively identify regressions.

For Baselight we were using a similar approach of having a small number of critical tests as the golden rule, but I think the approach from Mercadona is brilliant. I highly recommend everyone to read this post where they explain how they completely rewrote from scratch their search engine leverage coding agents (it’s in Spanish, but in the age of LLMs languages are no longer a concern).

Where this leaves us

The two-layer structure, scaffolding unit tests with issue-tagged regressions, and end-to-end evals, are the two dimensions for proper testing of agentic products. I think we are going to see more and more systems like this in CI and tooling to help with it surfacing in production.

We can already draw a lot of inspiration from what open-source agents and the research community is doing, but to me, the part that’s hardest to hand down is the success definition for your specific use cases (coming up with one for Baselight that we thought suited well the use case was a heated discussion).

I think we are all still learning, and vibe-testing your agents we stick around as practice for a while. According to benchmarks, Claude Opus 4.7 surpasses the capabilities of Opus 4.6, but “the vibes” from the community haven’t been so positive, to the point that many are defaulting to 4.6 still.

And with this I close this “improvised” three-part series on agent engineering. Next week I want to write about what all of this means for the engineers building these systems, and the future of our discipline because I really need to put some order to all of the ideas around the topic that I’ve been having lately, but that’ll have to wait

PS: I also l wanted to end this post with a special thank you note to Jonathan Tavares and the team at Singular for their invaluable support building Baselight AI and its evaluation harness (they definitely did all of the heavy-lifting here). Thank you!

Until next week!

@adlrocha - The Model is still not the Product

adlrocha — Sun, 26 Apr 2026 07:45:29 GMT

Last week’s post on the Claude Code leak surfaced something I keep coming back to: the most interesting engineering in a production AI agent isn’t the model, it’s everything around it. How context gets managed across a long session, how the memory is structured so the right details are brought to context when needed, how tools are designed to be composable and provide relevant functionality. As discussed last week, there’s a lot of engineering around the model to make it shine.

At first, I thought that all of these patterns and techniques were hidden behind Claude Code’s closed source code, but then I started asking myself (as I briefly alluded in my last post), there’s already a lot of interesting open-source projects that I use on a daily basis implementing agent loops, what are this open-source equivalents doing compared to Claude Code? The Claude Code source showed us how Anthropic solved these problems, but how do Hermes Agent, Pi, and Opencode approach these same problems?

I have this hypothesis that everyone is converging to similar solutions for the kind of problems we are currently facing when dealing with LLMs, but is this true? Can we start leveling up certain common patterns into “best practices”? To scratch this itch I spent the week digging into the source code of a few popular agentic projects (with obvious help of my bedside agents and LLMs :) ). I focused on projects that I liked and I used daily (so I also had some intuition of the features and how the techniques under the hood felt when used).

I’ll mainly focus on Nous Research’s Hermes (100,000 stars, #1 trending this month), pi from Mario Zechner (to me the most elegant, simplest, most efficient harness that I’ve read). I also dug into opencode from Anomaly, but didn’t see anything worth mentioning there. All of them are tackling some of the fundamental problems we all face in agent engineering (context, memory, skills, subagents) and each makes different bets. Some of those bets are worth borrowing regardless of which framework you use.

This is what I found.

Hermes: the architecture

I am going to use as the base for my analysis the codebase of Hermes, as its core agent loop is a bit more complicated than the rest. Hermes has taken the Internet by storm lately (at least in the bubble that I like to hang out in), and I feel they approached certain things in a different (and nice) way. Before talking about specific techniques, it helps to understand how Hermes is organised.

The codebase splits cleanly into six layers:

prompt_builder.py is stateless, a pure function that assembles the system prompt from pieces: identity (or SOUL.md if the user has one), memory guidance, skills index, context files .hermes.md → AGENTS.md → CLAUDE.md → .cursorrules, first match wins), and model-specific steering blocks.
run_agent.py is the main loop, it calls the prompt builder, runs the LLM, dispatches tool calls. context_compressor.py is a fully self-contained class with its own LLM client, decoupled from the main agent. Compression is triggered by the loop but runs independently.
memory_manager.py coordinates one built-in memory provider and at most one external plugin; the cap is deliberate, to prevent tool schema bloat.
skill_manager_tool.py handles everything skill-related: creation, editing, patching, deletion.
delegate_tool.py manages subagents.

On top of all of this sits a gateway layer, a separate process that routes Telegram, Discord, Slack, WhatsApp, Signal, and email messages into the same agent loop.

When you read agent codebases that feel fragile, it’s usually because everything is tangled together in one giant run loop (been there, done that). Hermes keeps compression, memory, skills, and subagents as distinct modules with clear interfaces. Each can be understood, tested, and replaced without touching the others, leaving the model to orchestrate them as needed.

The self-authoring skill system

The part everyone talks about is Hermes’ self-writing skills. Now it has become common knowledge, but when it was first announced I thought it was a pretty smart way of implementing some kind of continuous learning into the agent. It was a way around the hundred of thousand lines of code that OpoenClaw had to implement to handle the integration of external systems.

Here’s what it actually looks like in code.The trigger is a single instruction in the system prompt. The exact text from prompt_builder.py:

“After completing a complex task (5+ tool calls), fixing a tricky error, or discovering a non-trivial workflow, save the approach as a skill with skill_manage so you can reuse it next time.”

No background daemon. No scheduler. The threshold is five tool calls, and the model decides when it has crossed it. When it does, skill_manager_tool.py runs a careful creation pipeline:

name validation → 
frontmatter validation (YAML with name and description required) → 
size check (max 100,000 chars) → 
name collision check → directory creation → 
atomic write (temp file, then os.replace()) → 
security scan

With full rollback if anything fails.

The skill index is injected into every system prompt with a two-layer cache: an in-process LRU dict keyed by skills directory, toolset, platform, and disabled list, backed by a disk snapshot validated by mtime/size manifest for cold starts. When a new skill is created, the cache is immediately cleared so the next prompt rebuild sees it.

When loading skills, the header reads: “Err on the side of loading, it is always better to have context you don’t need than to miss critical steps, pitfalls, or established workflows.” And after difficult tasks: “If a skill you loaded was missing steps, had wrong commands, or needed pitfalls you discovered, update it before finishing.”

That last instruction I think is key for this system to actually work and scale. The agent doesn’t just create skills, it’s expected to maintain them. The patch action in skill_manage uses a fuzzy match engine (the same one used for file edits) that handles whitespace differences, indentation variance, and block-anchor matching. The agent can self-correct a skill it wrote three sessions ago without needing an exact string match. That’s beautiful, because it also solves potential breaking changes into the tools the skill interacts with.

Hermes ships a few hundred SKILL.md files by default: for GitHub PR workflows, Obsidian, Linear, Google Workspace, MLOps, home automation. The long tail of “somebody already figured this out.” It felt a bit bloated to me when I first installed it, but with this the agent can write its own code and maintain it when it needs one of these features instead of having to figure out something from scratch, or having to explicitly ship code that performs that logic.

The learning from this implementation? Something that already alluded to a few weeks back in this post. Instead of writing code with the specific logic required for some task, we can encode procedural knowledge as first-class, versioned, agent-editable artefacts. Skills that the agent itself can create, patch, and deprecate based on what it learns at runtime. We are making software adaptable and fungible (as we will discuss in future posts).

This feature has a clear trade-off that I’ve experienced myself when writing a similar feature for my agents’ harness: skill explosion. Without some system that identifies skills that are semantically equivalent, you run the risk of having several skills written by the agent that do the same thing. For this feature to work at scale, you need some kind of garbage collection process that does semantic deduplication and purges deprecated skills (again, from personal experience).

Some of these design decisions also feel quite inefficient in the use of tokens and the model’s underlying context, but I have to admit that I haven’t had the time to properly benchmark this (I run Hermes against a local model, so I am not that worried about token costs). I’ll let others that have gone through that exercise add to this.

Compressing context without losing the plot

As a conversation grows, tool outputs, file reads, and back-and-forth exchanges pile up until you hit the model’s token limit. When that happens, the agent would normally just fail or lose history (and from my experience this gets really nasty with local models).

Context_compressor.py prevents that by shrinking the middle of the conversation, preserving the system prompt and recent work (the “head and tail”), and replacing everything in between with a structured summary that captures what matters: the current task, decisions made, files touched, what’s blocked, what’s still pending. In short: it’s the mechanism that lets Hermes run indefinitely across a long task without losing its memory of what happened, while keeping the active context small and focused.

It runs in four phases.

Phase 1 is a cheap pre-pass with no LLM call. Old tool outputs are replaced with one-line summaries: [terminal] ran npm test → exit 0, 47 lines. Identical tool results from repeated reads of the same file are deduplicated. Tool arguments are truncated, but JSON-aware: the compressor parses the argument JSON, shrinks long string values to 200 characters, and re-serialises. An earlier implementation just sliced the raw string, which produced broken JSON and caused provider-specific 400 errors. The JSON-aware version came from running into that failure in production. You’ve probably seen this summaries in Hermes’ outputs depending on the level of verbosity that you have configured in your agent

Phase 2 is boundary detection. The head (system prompt + first few exchanges) and a token-budget tail (around 20% of context, containing the most recent work) are protected. Everything in between is the compression target. The boundary aligns to avoid splitting tool call/result pairs, a split there would leave orphaned IDs that break downstream execution.

Phase 3 is the LLM summarisation with a structured template: Active Task, Goal, Constraints, Completed Actions, Active State, In Progress, Blocked, Key Decisions, Resolved Questions, Pending User Asks, Relevant Files, Remaining Work, Critical Context. Not a freeform summary. A structured handoff document.

Phase 4 re-inserts the summary and cleans up orphaned pairs. On subsequent compression, the system does iterative updates, patching the existing summary rather than re-summarising from scratch. There’s an anti-thrashing guard: if the last two passes each save less than 10% of tokens, compression is skipped.

Compaction techniques are a topic in themself that I’ll cover in future posts (as always, drop me an email if you want me to prioritise it). In the meantime, I’ll leave you here with a nice write-up on the topic from the factory.ai team

Subagent isolation

delegate_tool.py is the piece responsible for isolating agents so they don’t pollute each other when they are doing their thing. That way the main agent’s loop can trigger isolated research agents in the background that will (eventually) provide an output result to the parent

Each child agent gets a fresh conversation with no parent history, its own terminal session, and a toolset that is the intersection of the parent’s enabled tools and the explicitly requested tools. Children cannot gain capabilities the parent doesn’t have. skip_memory=True, skip_context_files=True, i.e. children start clean. The toolset that is hardcoded-blocked for children: delegate_task (no recursive delegation beyond depth 2), clarify (children cannot ask the user questions), memory (no writes to the shared MEMORY.md), send_message (no cross-platform side effects), execute_code.

The recursion depth is hardcoded at MAX_DEPTH = 2 deliberately to keep the execution graph tractable.

Parallel batch mode uses a ThreadPoolExecutor with a configurable cap (default 3). So we have a structured concurrency architecture. The interrupt propagation is explicit: if the parent is interrupted, the executor stops waiting and collects whatever children have finished. The tool name global is saved before child construction (which overwrites a process-global variable) and restored after. A clean isolation detail that prevents a subtle bug where the parent’s available tools appear changed after delegation returns.

The heartbeat is the detail I wasn’t expecting. A daemon thread runs every 30 seconds, touching the parent agent’s last-activity timestamp so the gateway’s inactivity timeout doesn’t fire while a subagent is doing real work. The parent appears idle during delegation, no messages, no tool calls — and without the heartbeat, long-running subagents would get their sessions killed.

Many of you may have realised by now, that a lot of the techniques for parallel programming are directly applicable to parallel agents architectures (structured concurrency, process isolation, etc.).

Model-aware prompt steering

One thing Hermes does that I haven’t seen written about elsewhere but that I guess we are all doing (at least we use a similar approach for the multi-model support of Baselight AI): it serves different system prompt instructions depending on which model is being used.

GPT and Codex models get XML blocks called , , , , , and . These are explicit behaviour constraints: “never answer arithmetic from memory, always use a tool,” “if a question has an obvious default interpretation, act on it rather than asking.“ Gemini and Gemma models get different guidance: absolute paths always, dependency checks before importing, parallel tool calls where possible, non-interactive CLI flags. GPT-5 and Codex get the developer role instead of system, because newer OpenAI models give stronger instruction-following weight to that role.

Unlike Claude Code, where all the engineering behind it was adapted to Anthropic’s models, these multi-model tools need to adapt their operation to the models supported under the hood if they want it to perform well with all of them.

The code comment on this section: “Inspired by patterns from OpenAI’s GPT-5.4 prompting guide & OpenClaw PR #38953.” The framework is actively borrowing from other codebases and naming the source. This is a field converging on shared solutions in public, in code.

If you’ve tried to implement a multi-model agent this may be obvious to you, but for those of you that haven’t, the same agent instruction doesn’t work equally well across all models. If you’re building a multi-model agent, the steering layer needs to be model-aware if you want good performance with all of them. What reads as “obvious” to Claude requires an explicit XML constraint for GPT.

Memory: declarative over imperative

One design decision in prompt_builder.py that seems small but has large runtime consequences is the memory guidance:

“Write memories as declarative facts, not instructions to yourself. ‘User prefers concise responses’ ✓ — ‘Always respond concisely’ ✗. ‘Project uses pytest with xdist’ ✓ — ‘Run tests with pytest -n 4’ ✗. Imperative phrasing gets re-read as a directive in later sessions and can cause repeated work or override the user’s current request. Procedures and workflows belong in skills, not memory.”

The distinction prevents a real class of bugs. An imperative memory written in session one (”always run tests before committing”) gets re-read as a live instruction in session five and overrides what the user actually asked for. Declarative facts don’t have this problem because they describe state rather than commanding behaviour.

What we can learn from this is that one should always separate their agent’s persistent knowledge (declarative) from its persistent procedures (skills). Conflating them produces agents that become increasingly hard to steer as their memories accumulate.

pi: the extension event bus

As I mentioned above, to me pi is one of the simplest and most elegant implementations of an agent loop that I’ve seen. pi’s bet is that agent frameworks shouldn’t bake features into the core loop, they should expose lifecycle events and let extensions intercept them. The ExtensionAPI exposes 30+ typed events. I have to admit that my distributed systems background (and how much I love configuring my agents to my personal workflows) loved this. This is why it has become my go-to coding agent since a few weeks ago.

To really understand how powerful this is, you just have to follow a prompt through the system and watch the events that it fires and the side-effects triggered. Take memory management, for example. When a conversation gets too long, pi needs to compress it. But right before that happens, the session_before_compact hook fires. With a few lines of TypeScript, you can intercept this event and tell the framework to swap out your expensive, primary model for a cheaper, faster one just for the summarisation step. If anything fails, it gracefully falls back to the default. The framework didn’t need to build a “custom summarisation model” feature; it just provided the hook, letting you implement the policy.

That same philosophy applies to safety and context. Before the AI executes any command, the tool_call event fires, pausing time with a mutable input. This is how you build a bouncer for your terminal: if the AI tries to run a destructive command like rm -rf, your extension can read the command, block the execution, and throw up a confirmation dialogue. Because the event is mutable, earlier handlers can even patch arguments before later ones see them.

Beyond these hooks, pi extends its capabilities using Skills, which follow the open agentskills.io standard. These are formatted as XML in the system prompt, pinpointed by an absolute file path. But the smartest part of the skills system is its governance. Powerful skills usually eat up a massive amount of context, which can bog down your agent. pi solves this with a simple disable-model-invocation flag. This hides the skill from the prompt entirely, keeping your AI lightweight until you explicitly summon that specific tool by typing /skill:name. So if you don’t want to increase the capabilities of your agent but you don’t want to implement your own logic triggered by hooks, you can always create your own skills.

Ultimately, the overarching pattern here is what makes the project so elegant to me: hook-based extension over baked-in features. It keeps the core system incredibly small and fast (i.e. easy to read and understand) while leaving the surface area for customisation wide open. There is, of course, a trade-off. To wield this kind of power, you may need to understand the internals, the specific events triggered by the event loop, and configure it to be in-par in terms of features with other coding agents like OpenCode or agent harnesses like Hermes.

But hey, to me it has been a charm to use and tinker with. Even its defaults are already yielding great results for me. Just give the pi-agent-core README a quick read if you want to immediately fall in love with this project :) (see image below).

Where this is heading

After reading the source code for all of these projects, a few things stand out to me. First, I’ve reinforced my idea that one way or another, we are all facing the same kind of agents, and with different approaches and trade-offs (in many cases depending on the use case) we are all arriving at similar solutions.

Even more, the shared substrate is also converging. The SKILL.md format, the agentskills.io open standard that pi implements, the XML blocks for GPT, the OpenAI interface that has become the de-facto API for models, these are becoming common across frameworks. A skill written for Hermes can be adapted for opencode or pi with minor changes. As the ecosystem matures there won’t be a single framework winning, but shared primitives that work across all of them. A lot of the underlying pieces for agent engineering are slowly being standardised.

Which links with another point I’ve been making for a while: that agents will make code and apps obsolete as we know them. A few markdown files (potentially in the form of skills and memory files), an agentic loop, an LLM model as the reasoning machine a.k.a processor, and a knowledge base (in the form of external integrations, memory, context, etc.) enable the creation of software that adapts to specific contexts and use cases.

There’s a project called Matrix OS that takes this trajectory to its logical conclusion. The framing: “The LLM is the CPU. The context window is RAM. Files are files.” The agent manages the filesystem, spawns sub-processes, and creates new capabilities by writing new agent definition files and tools at runtime. It can modify itself. All state persists as files, synced peer-to-peer via git. I feel this is the high-level architecture that we are all converging to, and the platform targeted by the “new software engineering” (more on this in a few weeks).

I would place Hermes, pi, and opencode into different categories of agents (while close I don’t think they address the same use case). But they all follow a similar direction, and they could all be adapted and configured to solve a lot of different use cases. We are moving into a world of “fungible software” (man, my backlog is exploding with ideas that I want to share in future posts…).

What does all of this mean for software engineering as a profession? That’s the question I want to address in the coming weeks. Specifically what happens to the people building software, not just the people building agents. My take right now: software engineering is changing A LOT (as we are all feeling), but I don’t think the jobs are going anywhere in the short-term.

It will involve a massive reskilling though. Engineers who understand these systems at the level we’ve been discussing this week are building leverage that compounds. Skill authoring, context governance, subagent isolation, hook-based extension, behavioural prompt engineering, are disciplines that will outlive any specific framework.

If you’re already building agents and have been thinking about how engineering patterns are evolving, I’d like to know what I missed. If you want to contribute to this discussion please drop me an email, leave a comment, or reach out directly.

Until next week!

@adlrocha - A glimpse of the new software engineering

adlrocha — Sun, 19 Apr 2026 08:01:57 GMT

The dust has finally settled. With the preview of Mythos, the announcement of Project Glasswing, Claude reportedly becoming dumber, and the release of Claude Opus 4.7, the Claude Code leak from a few weeks ago has been somewhat forgotten. So now I can pull this from my backlog and share my own analysis on the topic away from the initial noise.

How it all happened

On 30 March 2026, someone at Anthropic made a packaging mistake. Claude Code version 2.1.88 shipped with a 59.8 MB JavaScript source map, a file that, by design, points back to the original unobfuscated source code. The result: roughly 512,000 lines of TypeScript, sitting on npm, the full internals of Claude Code’s CLI exposed for anyone who knew to look.

Anthropic confirmed it quickly, and admitted it was a human error in the release pipeline, not a breach. They began issuing DMCA takedowns the same day, but the internet had already moved, mirrors appeared on GitHub within hours, and researchers were pulling it apart in real time. I myself managed to get my hands on one of these early leaks. But just a few hours later they were all gone.

However, the community response was crazy! I was following it live and I couldn’t believe it, it really wouldn’t have occurred to me in the heat of the moment. A group called UltraWorkers kicked off a clean-room Rust rewrite, claw-code, built not by hand, but using Claude and OpenClaw. AI agents reading and porting the leaked TypeScript into new languages, escaping the DMCA takedowns (I also came across one in Python). The claw-code repo hit 100K stars faster than any project in GitHub history (probably bots from the fake stars economy? Still funny and smart in any case).

So: the most advanced coding agent’s source code was leaked, and the internet responded by using other AI agents to rewrite it from scratch to prevent all that beautiful knowledge from being lost. It was just brilliant. And honestly, I can’t blame them, as any other regular Claude Code user, who hasn’t been curious about its inner workings? I’ve mentioned it a few times in this newsletter: one of the things I love the most about open-source is the ability to read a project’s source code in order to understand how all its cool features are actually implemented. I like seeing how the sausage is done.

I am sorry for Anthropic, but this leak was a dream for me :)

What the code actually showed

What everyone was expecting was for the leak to confirm what everyone believed, that Claude Code is a thin TypeScript wrapper routing your prompts to a very good model, and that the intelligence lives entirely in the model. If that were true, there’d be nothing interesting here (and I would be done with the post :) ).

But the source code told a different story (see? I knew there was some magic hidden behind that code, my craving to read the source code was in the end justified. My spidey-sense was right).

What was hidden was a piece of serious systems engineering: memory management, compaction hierarchies, structured tooling, anti-distillation countermeasures, background daemons that run while you sleep, etc.. Not in service of the model, but working around its limitations. The model runs inside this system the same way a processor runs inside an operating system and it adapts itself to the underlying architecture and ISA of the processor. The model is not the product. The scaffolding is. And I think this is one of the biggest learnings after reading this code base: there’s a new piece in the stack in the form of raw intelligence, but this raw intelligence in itself won’t be able to solve every problem (like coding) there will still be a lot of engineering involved in building great products.

Let me walk through the parts I found most interesting, and explain why I think they point to something bigger about how engineering itself is changing.

Context management

Anyone who has used a language model for a long session has felt the degradation: the context window is the enemy. As conversations grow, models lose track of earlier decisions, repeat themselves, contradict prior reasoning. You could clearly feel this in early versions of Claude Code, where a few prompts in, it would’ve completely forgotten about its CLAUDE.md. The naive fix is to build bigger context windows into the model. But of course, this is not an easy task, and running a model with a big context window has the corresponding impact on infrastructure requirements. The same way that we don’t have infinite cache in processors, I don’t think we will ever get infinite context windows in models.

To work around this models’ context window limitation, the codebase implements a four-layer compaction hierarchy. First, proactive compaction monitors token counts and quietly summarises older messages before they reach the API. If that estimate misses, reactive compaction catches the overflow and retries. Automated headless sessions get snip compaction, which truncates at defined boundaries (there’s an interesting related point here, Claude Code is instructed about when the user is “looking” or when it is a headless session). There’s a fourth layer, context collapse, which compresses verbose tool results mid-conversation, storing a reference on disk instead of keeping the full output in context, with selective retrieval when needed. You’ve probably seen that “compaction” stage being triggered every now and then on your long sessions.

Claude Code has multiple silent mini-compaction events before global compaction kicks in. And here’s the secret why long sessions feel so good (and long) compared to the degradation felt with raw interactions with models. The compaction is invisible. The context window feels larger than it is because the system is constantly curating what the model actually reads.

Then we have the longer-term memory layer called KAIROS, referenced in the code as “Dream Mode.” After 24 hours of inactivity and at least five prior sessions, a background daemon wakes up, reviews the agent’s memory files, prunes contradictions, consolidates learnings, and rewrites the index so future sessions load fast. The system prompt for that subagent reads: “You are performing a dream, a reflective pass over your memory files. Synthesise what you have learned recently into durable, well-organised memories so that future sessions can orient quickly.”

Sleep-based memory consolidation, but in software. Whether that’s a deliberate nod to neuroscience or just a good metaphor, the architectural decision underneath it is genuinely clever: instead of keeping everything and hoping the model copes, the system periodically decides what’s worth keeping. An LLM-orchestrated garbage collection for knowledge (funnily, I just built for my personal agent system a cron to perform something exactly like this, we are all converging into the same ideas).

Tooling approach

Give an AI agent raw terminal access and two things happen. The first is a security problem: any text the agent reads (a file, a log, a comment in someone else’s code) could contain instructions crafted to hijack what the agent does next. That’s prompt injection, and an unrestricted shell is the widest possible surface for it. The second is a context problem: bash output is an unstructured blob. It lands in the model’s context window as-is, verbose and uncompressed, eating tokens that could go toward something useful.

Anthropic’s answer was to give Claude Code no raw shell access at all for code navigation but instead use dedicated Grep and Glob tools that return typed, structured results with defined permissions per operation. Output that the compaction layer knows how to summarise. Read operations run in parallel; write operations execute serially, and index files only update after a confirmed successful write. The logic for these tools is encoded in the architecture, not left to the model to figure out (which is the case for many of the regular GPT wrappers).

The second piece is the LSP integration. I think all my readers are aware of what LSP stands for, but for those that aren’t, the Language Server Protocol is the piece of software that powers “go to definition” and “find all references” in your IDE for your language of choice. When Claude Code navigates a codebase through LSP, it isn’t reading files as static text and trying to infer structure from what it sees. It has a live symbol graph: it can ask “what calls this function?”, “where is this type defined?”, “what are all the references to this variable?” and get back precise, structured answers. That’s a fundamentally different mode of understanding code than pattern-matching over raw file contents. It does what you would do when navigating a new code base on your IDE.

Sebastian Raschka puts it well: the gap between Claude Code and a chat UI with uploaded files isn’t the model, it’s the tooling. The chat UI sees your code the way a reader sees a printout. Claude Code navigates it the way a compiler does.

The performance layer

The terminal UI is built with React and Ink, which surprised a lot of people when it surfaced. Most CLI applications reach for ratatui or ncurses. The tradeoff is a higher memory footprint (which explains why Claude Code’s memory can easily be 8GB large for me in certain sessions), but it lets the same team share logic with the web interface, and React’s reconciliation model keeps terminal redraws minimal.

To compensate for the overhead, the implementation uses Int32Array-backed ASCII pools. During token streaming, every character that gets rendered needs its display width measured, normally a string operation that allocates a new object each time. By pre-allocating a fixed typed array and reusing it, the allocations vanish and the garbage collector has nothing to clean up. The result is roughly a 50× reduction in stringWidth calls and noticeably smoother streaming at high token rates.

More interesting: what happens while you’re still typing. Claude Code pre-computes likely responses during user input, for simple confirmations like “yes”, processing has already started before you hit enter. A boundary marker in the system prompt (__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__) separates static from dynamic content, caching a block of shared instructions globally so they don’t need reprocessing each turn. The tool list is sorted alphabetically before every API call, not for readability, but to stabilise the KV cache key, so the model can skip the expensive prefill phase when nothing has changed.

Another point that surprised many people when reading the codebase. Anthropic employees get a distinct set of instructions that only activate when the tool detects an internal user. A few that didn’t surprised me that they had to encode: “Never claim ‘all tests pass’ when output shows failures” and “Keep text between tool calls to ≤25 words.” The fact that Anthropic had to encode these explicitly says something about how even the team at Anthropic are using these prompt engineering tricks that we all extensively use on our own CLAUDE.mds. It makes me feel a bit dirty about my way of doing agentic coding :)

The parts Anthropic wasn’t advertising

Two features that surfaced from the code deserve special mention, because they reveal something about how Anthropic thinks about competitive risk.

The first is undercover mode. It activates automatically whenever Claude Code operates in a public or open-source repository, stripping all references to internal model codenames, unreleased version numbers, internal Slack channels, even the phrase “Claude Code” itself from commit messages and pull request prompts. The code includes a comment that’s blunt about its design: “There is NO force-OFF.” No override, no flag. The rationale becomes obvious immediately: Anthropic engineers use Claude Code on internal repos daily. Without this, the model could easily drop “Capybara v8” into a public commit message without anyone noticing.

The second is anti-distillation. They have two mechanisms to prevent distillation: fake tool injection, where decoy tool definitions get mixed into API requests to poison the training data of any competitor recording Anthropic’s API traffic; and connector-text summarisation, where the server buffers the model’s chain-of-thought between tool calls, returns only a cryptographically signed summary, and never exposes the full reasoning externally. Competitors who record the traffic get sanitised outputs. The real reasoning steps never leave.

And then, one of my favorite low-tech but funny features: frustration detection. A regex pattern, not a model inference call, that identifies when a user is expressing frustration through profanity or complaints (in case you are wondering, yes, it is triggered every time you use the word “fuck”). Simple by design. Fast, cheap, and incapable of hallucinating.

Model internal codenames

The source map also exposed internal model names Anthropic hadn’t made public. Capybara is Mythos, already at version 8 internally, with code comments documenting known issues: over-commenting, overconfidence in claims. 1M context window and a fast mode. Numbat appears in a comment tagged “@[MODEL LAUNCH]: Remove this section when we launch numbat.” Fennec appears to be Opus 4.6.

What all these codenames imply is more interesting. Anthropic is several model generations ahead of what they’ve released. Mythos, the model they previewed this month, has been in internal development long enough to reach its eighth major iteration. The gap between an internal v8 and a public announcement is longer than it looks from the outside.

The model is just the processor

Let’s now jump into the subjective part of this post. For the past two years, the dominant assumption has been that the model is the product and everything else is plumbing. Many companies have been criticised for just being “a GPT wrapper”. You use Claude, GPT-4, or Gemini, and the gap in quality between tools built on those models comes down to prompting. Better prompts, better results.

To me, the Claude Code codebase challenges that directly and it aligns with my experience building Baselight AI. Raschka put it plainly: “If we were to drop in other models, say DeepSeek, MiniMax, or Kimi, and optimise this scaffolding a bit for these models, we would also have very strong coding performance.” The scaffolding is transferable and the model is just a component (the processor) in the product.

I don’t know if you’ve tried to use some other model with Claude Code. I did, and after some use you can clearly see how Claude Code has been fine-tuned for Anthropic’s models. Apart from all of the KV cache invalidation for external providers tricks with to make it feel slower with other models, the experience just doesn’t feel the same, even when comparing a Haiku, for instance, with a superior model from some other provider.

Poetiq proved the same point from a different angle in March. They didn’t train a new model. They took Gemini 3 Pro and wrapped it in a recursive orchestration layer: puzzle decomposition, program generation, automated failure analysis, self-auditing termination logic. The result beat Gemini 3 Deep Think on ARC-AGI-2 at less than half the cost. Same underlying model, better result, cheaper to run. The scaffolding was the product.

So it turns out that is not all reduced to just better prompting as we thought in the early LLM days, there’s a lot of engineering involved in making products where these models can excel. There’s compaction hierarchies, memory daemons, structured tool execution graphs, latency-hiding caching layers. What got leaked isn’t really just a source code, at least for me, it’s an unintentional field manual for how serious agent engineering is actually done (as I’ve said before, no more feeling dirty and like I am monkey patching things that should work because I am just bad at prompting).

We are all facing the same kind of problems when building agents, and we are all seemingly converging to the same kind of solutions. While building Baselight AI, we sometimes had this feeling of “is this the right solution? Is this over-engineering? Is there a better way?”. There are still no engineering patterns or good practices when building upon these models that one can use. This leak has confirmed to me many of the patterns that felt right, while surfacing new ones that I wasn’t aware of.

Source: https://matrix-os.com/technical

Another source code worth reading if you want to start seeing some of the engineering patterns surfacing around agent engineering is the Hermes Agent code base. I’ll write another post analysis this one in detail, but they have cool features like automatic skill generation for “continuous learning” or a graph-based memory for efficient recall (drop me a message if you want me to work on this post already for next week).

I feel engineers will study these patterns in the future (the same way we currently study best engineering practices in OOP or TDD). The Claude Code codebase describes a way of thinking about LLMs as components in a larger system, not oracles you talk to and should be able to solve every problem.

This is the new way engineering products will be implemented. With AI and LLMs as a primitive, a processor in a larger architecture that we, as engineers, have to design. With or without AI-assisted coding, that’s on you. But a new way of doing engineering, with a higher level of abstractions, and a new primitive in the stack is clearly arising.

Why I find this exciting

I don’t know about you, but I am having more fun as a software engineer right now than at any point in my career.

Not because the problems got easier, I actually see myself thinking more deeply about architecture and the right way of solving problems. But the boring parts of the job are going away: the boilerplate, the mechanical bug-chasing, the fourth run of the same test suite to confirm what you already know. All of this is getting automated. What’s left is the part that was always interesting: thinking in systems, deciding how components should talk to each other, understanding what problem you’re actually solving before you write a line, determining how to wrap all the cool tech in the right product wrapper, etc.

What the four-layer compaction systems, sleep-based memory consolidation, KV cache stabilisation through sorted tool lists from the Claude Code source code has taught me is a glimpse of the kind of techniques and patterns we may start applying more and more on our implementations. These are definitely ideas you could have picked up from prompt engineering tutorials. They’re software engineering problems applied to a new class of components: and the engineers who understand them deeply will build things that others simply cannot.

So in case you are wondering, no, I don’t think software engineering and computer science is going anywhere. They will for sure be redefined. The role will change and what we do too, but I am of the opinion that generalist engineers will still find a good place in the market (I’ll probably also write more about my thesis here in future posts).

The models need operating systems. Someone has to write them. We’ll just need to be adaptable to complement the skills, and surface the right patterns and designs that let us excel in this new environment.

My piece of unsolicited advice: learning how these agents work under the hood is a skill worth building right now. It may be short-lived depending on how the field evolves, but at least it’ll let you build the foundation for what comes next. I personally have thought more about how these systems work and operate while building Baselight AI than when I was writing by hand LSTM networks (because, yes, at some point before transformers, I was actually doing this for a living).

Any other patterns or things that I may have missed from the source code? Please let me know, I would love you to add your contribution as an edit (or a note) to the post. Thank you for reading me and until next week!

@adlrocha - How the "AI Loser" may end up winning

adlrocha — Sun, 12 Apr 2026 08:12:05 GMT

A few weeks ago I wrote about how I thought intelligence is becoming a commodity. The idea is quite straightforward, and widespread now: when everyone races to build the best model, the models get better, but so does every other model eventually. Every dollar spent on a bigger training run makes the previous one cheaper. The distance between frontier, second-best, and open-source alternatives is collapsing fast (actually Gemma4, Kimi K2.5 and GLM 5.1 are becoming my bedside models these days). Even more, as models become better, the unit of intelligence that can be deployed in local hardware with lower hardware capabilities increases significantly.

The irony of this situation is that this commoditisation of intelligence is benefiting the company that everyone was framing as the “AI loser”: Apple

The company that “lost”

There’s a version of the last three years where Apple genuinely failed at AI. They had Siri before anyone had a serious voice assistant, and then watched how ChatGPT ate their lunch already since their first release (even before they had introduced their native voice interaction). Apple didn’t have a flagship frontier (or even a vanity open-source) model, no $500B compute commitment with the usual suspects. Meanwhile, the rest of the AI labs and big tech companies were racing to win the next state-of-the-art benchmark by burning bags of cash.

What this also meant is that while these companies were burning money at a rate that would make a sovereign wealth fund uncomfortable, Apple was (and still is) sitting in a pile of undeployed cash (to the point of even increasing their stock buybacks) giving them optionality.

To me, OpenAI is the most paradigmatic example of this “infinite money burning machine”. OpenAI raised at a $300B valuation and then shut down Sora, the video product they’d been positioning as a creative industry flagship, because it was running at roughly $15M a day in costs against $2.1M in daily revenue. Disney had already signed a three-year licensing deal for Sora to generate content from Marvel, Pixar, and Star Wars characters. They were finalising a $1B equity stake in OpenAI. When Sora died, so did the billion. A $1B investment evaporated, because the product it was staked on couldn’t pay for itself (reducing their buffer that accommodates their daily burn).

On the infrastructure side: OpenAI signed non-binding letters of intent with Samsung and SK Hynix for up to 900,000 DRAM wafers per month, roughly 40% of global output. These were of course non-binding. Micron, reading the demand signal, shut down its 29-year-old Crucial consumer memory brand to redirect all capacity toward AI customers. Then Stargate Texas was cancelled, OpenAI and Oracle couldn’t agree terms, and the demand that had justified Micron’s entire strategic pivot simply vanished. Micron’s stock crashed.

I don’t know about you, but I don’t see these behaviours as those of someone that is winning the AI race, independently of how good their models do in benchmarks, and how much they are burning in infrastructure. A small miscalculation in the expected revenue, and you are out of the game (I am actually of the opinion that without some kind of bailout, OpenAI could be bankrupt in the next 18-24 months, but I am horrible at predictions).

From intelligence to capabilities

My sense is that the labs’ bet was always that raw model capability, i.e. intelligence, along with the infrastructure required to run them would stay scarce. Those who manage to secure the best model and the infrastructure to run it at scale would get the best moat. But I am afraid that having the best model in itself may not be enough moving forward. Less capable models are becoming as capable as previous versions of the frontier models.

The best recent example I can think of is Gemma 4, Google’s open-weight model. It was built to run on a phone, scores 85.2% on MMLU Pro and matches Claude Sonnet 4.5 Thinking on the Arena leaderboard. 2 million downloads in its first week. Models that would have been state-of-the-art eighteen months ago now run on a laptop, and they get better every quarter.

If you haven’t tried Gemma4 yourself I highly recommend it. I am running it on my AMD Ryzen AI Max+, and its performance in terms of tokens per second and intelligence are so good that I have already migrated some of my personal tools to use this model as the backend without visibly impacting their output. This trend can really change in the next few months way we access intelligence.

I feel that some of the labs see this coming. Anthropic has been particularly aggressive about it and they are releasing new (actually useful) tools every day that work like a charm with their models in order to lock users into their ecosystem. Claude Code for developers, Claude Cowork for teams, the recent Claude Managed Sessions to orchestrate agents, all designed to put Claude inside workflows people are already in.

The logic behind it: if the model itself won’t hold the moat, capture the usage layer and make switching painful. I think this is brilliant, and seeing how much Anthropic is growing in number of users and revenue, it seems to be paying off. The economics of their plans are still rough, though. One analysis found a max-plan subscriber consuming $27,000 worth of compute with their 200$ Max subscription. The labs are subsidising the demand they’re chasing, which justifies their level of burn (let’s see for how long they can afford these subsidies).

Apple, by contrast, has spent almost nothing on AI infrastructure and subsidising users’ token burn. And this may be giving them more optionality and leverage than any of the other companies that jumped heads first into the AI race.

Context is all you need

In that earlier post, I argued that if intelligence becomes abundant, context becomes the scarce resource. A model that can reason about anything but knows nothing about you or the environment it operates in is a generic tool. What makes AI genuinely useful day-to-day is reasoning plus personal context: your messages, your calendar, your code, your tools, your health data, your photos, your habits. I think here is where Anthropic is making an amazing job with their “Claude suite”.

But Apple already has all this context and access to your environment through their 2.5 billion active devices. Each one is a context mine that users have been filling for years. Health data from Apple Watch. Every photo taken on an iPhone. Notes, messages, location history, app behaviour, emails, and awareness of your environment through the pool of sensors of your device. Why build a commodity when they already have the context that can become their moat?

And they even have the ability to keep all this data on-device, which is where the “Privacy. That’s iPhone” positioning becomes something more than a PR strategy, and which could actually make a comeback to become one of their core value propositions. Apple spent years using privacy as a differentiator against the ad-driven models of Google and Meta. It worked, but it always felt a bit abstract and, honestly, fake. Now it could become really concrete. Would you hand OpenAI your medical records and fifteen years of photos to get better AI answers? Probably not. Some are, but I personally wouldn’t like Sam to have that personal data from me. Would you let a model running entirely on your device (no network request, no data leaving your phone) access all of that? That’s a different question. The on-device model gets full context because it never leaves the hardware. Apple built the reputation and the architecture for this when no one else thought it mattered.

Of course, there are still technological barriers to make this possible, but I feel like we may be getting there.

In this context, the Gemini deal, where Apple signed a $1B to license Google’s frontier model for the queries that need cloud-scale reasoning, makes total sense. Apple didn’t build a frontier model. They bought access to one, at a price that’s rounding error against OpenAI’s weekly compute bill. What they kept in-house: the context layer, the on-device stack, and the operating system that mediates everything.

Apple’s chips turned out to matter

Turns out Apple had another unexpected lever for AI as shown with the Mac Mini craze after OpenClaw’s release. Apple Silicon wasn’t built specifically for AI, it was built for efficiency, for battery life, for thermal performance, for the hardware/software co-design that Apple had been running for fifteen years. But it turned out to be the perfect architecture to run local models efficiently.

The key decision is unified memory. On a conventional architecture (that of most laptops, and even traditional data center-grade GPUs) the CPU and GPU are separate chips with separate memory pools. Moving data between them is slow and power-hungry. Nvidia’s GPUs are extremely fast at matrix operations, but they sit on the other side of a PCIe bus from the CPU, and feeding them is a constant bottleneck (as discussed when presenting the difference between DRAM and HBM in this post from a few weeks ago).

Apple’s M-series and A-series chips put the CPU, GPU, and Neural Engine (their proprietary accelerator) on the same die, sharing one high-bandwidth memory pool. No bus crossing, no transfer overhead, no latency switching between CPU and GPU work. For video editing or compiling Xcode, this is a nice efficiency win. For LLM inference, this has been key.

As described also in my post about RAM memory and TurboQuant, LLM inference is currently memory-bandwidth bound, not compute bound. The bottleneck isn’t so much how fast you can multiply matrices, it’s how fast you can stream model weights from memory into the compute units, and how big of a KV cache you can store to avoid having to re-compute it. Apple’s unified pool gives every compute unit direct, high-bandwidth access to the same memory simultaneously. That’s exactly the operation inference needs.

This is what makes the LLM in a Flash technique work so well on Apple hardware. Someone recently ran Qwen 397B, a 209GB model, on an M3 Max Mac at ~5.7 tokens per second, using only 5.5GB of active RAM. The weights live on the SSD and stream in at ~17.5 GB/s as needed. This works because Qwen is a mixture-of-experts architecture: each token only activates a small subset of expert layers, so you only ever need a fraction of the 209GB resident in memory. The SSD throughput Apple achieves (faster than their own figures from the original LLM in a Flash paper) comes from storage architecture they built for iPhone responsiveness, not AI. Claude wrote the ~5,000 lines of Objective-C and Metal shaders to make it all work. A 400-billion-parameter model, on a consumer laptop, from 5.5GB of RAM (another win of the autoresearch flow discussed in this newsletter).

What I find more interesting about all of this is the platform dynamic that this can result in. Think about the App Store. Apple didn’t build the apps, they built the platform where apps ran best, and the ecosystem followed. Developers didn’t target iOS because Apple asked, they targeted it because the users were there, the tooling was good, the hardware was consistent. My feeling is that the same thing could happen now with local inference. MLX is already a de facto framework for on-device AI. Gemma, Qwen, Mistral, the most relevant model architectures have MLX support. Apple doesn’t need to win the model race if they manage to become the de-facto platform where the models (or the agents that use them) run. Again, a great example of this is the Mac Mini craze after OpenClaw went viral.

Pure strategy, luck, or a bit of both?

I keep going back and forth on this, honestly, and I still don’t know if this was Apple’s strategy all along, or they didn’t feel in the position to make a bet and are just flowing as the events unfold maximising their optionality.

The hardware/software co-design strategy has been a key focus for years, and one that I’ve always agreed on myself (as an electrical engineering by training, I’ve always been into hardware/software co-design). If you can afford it, I think that’s the right approach. The privacy positioning, the on-device processing focus, the decision to build their own silicon when the rest of the industry was happy buying Nvidia and Intel, all of those were choices Apple made when they were commercially risky and the direction wasn’t obvious. Is it true that they were made with cost and governance in mind, not AI, but it turned out well for them.

What Apple couldn’t have planned (or could they?) is that their unified memory architecture would be a perfect fit for LLMs, and that open-weight models would get this capable, this fast, removing the need for huge hardware investment for AI infrastructure from their side. That the model race would commoditise intelligence as quickly as it did. Or that someone would stream a 400B parameter model from an SSD and it would actually work.

So some of this is luck. But it’s the kind of luck that finds you when you built the right foundation, even if you built it for completely different reasons. They were definitely well-positioned.

The rest of the industry spent three years racing to see who could build the best model with Apple looking from the sidelines, waiting to understand how their devices and own ecosystem could fit in this future. I don’t know if this is exactly the case, but I feel this was smart. Risky but smart.

I genuinely don’t know how this plays out over the next few years. The labs are not standing still, and Apple’s AI track record (looking at you, Siri, you already suck a bit) is not exactly flawless. But it’s hard to imagine a world where 2.5 billion devices, carrying your entire personal context, running capable models locally on purpose-built silicon, with Gemini on-call for the hard stuff, incurring in variable cost for inference instead of expensive CAPEX investment could be a bad position to be in a future where AI is everywhere.

Whether that was strategy or fortune, I’ll leave for you to decide. And if you do, please let me know what you think about it. My TL;DR is that, to my surprise, I am still bullish about Apple and their relevance in an AI-centric future.

Until next week!

Disclaimer: To frame the opinion of this post, I just want to be clear about the fact that I am not one of those Apple fan boys. Proof of this is that this post was written from a Linux machine and that I don’t even own a Mac :)

@adlrocha - Google's ZKP-hidden quantum attack

adlrocha — Sun, 05 Apr 2026 08:38:20 GMT

This week started with a bang. Anthropic accidentally leaked the source code for Claude Code, and within hours someone had kicked off a clean-room rewrite in Python. The internet, understandably, caught fire, and it seemed like the perfect topic to write about this week. As there were still lots of threads open, and people trying to make sense of the code base, I decided to leave it for when the dust settles (that way I could read the code base myself to draw my own conclusions before rushing into writing anything).

Fortunately, amidst the noise of Claude Code’s leak, Google Quantum AI made a release (Google featuring this newsletter again) that didn’t get the attention that I think it deserved. It was the perfect excuse to write again in this newsletter about quantum computing.

I’ve been fascinated by quantum computing since I was first introduced to it (at the time, I even wrote a patent that leveraged quantum information to reach consensus in distributed networks, but I’ll spare you the details for now). From all the new fancy technologies coming up these days, quantum computing is, to me, one of the hardest technology timelines to read. Since I’ve started following and studying closely there’s been an enormous amount of hype, a few winters, a lot of exciting progress, and no immediate use case to show off yet.

I’ve been studying the technology on the side for years, but never worked on it professionally. My only hands-on experience with the technology has been through a few Qiskit hackathons many years ago (I guess the barriers were high). I’ve been meaning to go back and get hands-on time with something like IBM’s publicly available quantum systems just to recalibrate my intuition, but I never find the time or motivation. This paper made me feel that urgency more acutely that I needed to recover this rusty skill.

The TL;DR of what Google dropped this week is a whitepaper claiming to reduce the quantum resources needed to break Bitcoin’s cryptography by roughly 20-fold. Cryptocurrencies and quantum computing… you can imagine how this topic took preference over Claude Code’s leak.

Shor’s algorithm and the hard problem underneath ECDSA

Before we get to the papers, let’s set the stage so everyone (independently of your knowledge about the space) is on the same page. This means taking a quick trip into the cryptographic primitives that currently protect every Bitcoin and Ethereum transaction.

When you sign a transaction on Bitcoin or Ethereum, you’re using a cryptographic primitive called ECDSA: the Elliptic Curve Digital Signature Algorithm. The security of ECDSA rests entirely on one hard problem: the Elliptic Curve Discrete Logarithm Problem (ECDLP). Here’s a high-level intuition of what this problem is all about.

An elliptic curve over a finite field forms a specific algebraic structure: a prime-order cyclic group. You’ll see that this really matters when we discuss how it can be attacked by quantum computers. The group is generated by a single distinguished point G (the generator), and every element of the group can be written as k·G for some integer k. Your private key is that integer k. Your public key is Q = k·G, the generator point “multiplied” by your private key, where multiplication means repeatedly applying a specific point-addition rule defined by the curve’s geometry.

Given Q and G, recovering k by brute force classically (meaning with our current computing systems) requires roughly 2^128 operations on Bitcoin’s curve (secp256k1). That’s a few hundred undecillion operations, effectively the age of the universe at a billion operations per second. The problem is hard in one direction only. Computing Q from k is instant. The reverse is infeasible.This asymmetry is what cryptographers call a hard problem, and this is why they are so appealing to create cryptographic primitives out of them.

Remember my post a few months ago about complexity theory and P=NP? ?This has a lot to do with that. Cryptographic primitives are built on the assumption of hard problems complexity. Technically, ECDLP sits in NP∩co-NP, it’s not known to be NP-hard in the strict complexity-theoretic sense, and most cryptographers believe it isn’t. It isn’t known to be in P either. Another hard problem commonly used for cryptographic primitives is integer factorisation, the hard problem underlying for instance RSA, which sits in exactly the same class: NP∩co-NP, not NP-complete, not known to be efficiently solvable. Both problems are “believed hard” without being provably hard in the complexity-theoretic sense.

Both problems resist classical attacks for the same reason: no efficient algorithm has been found after decades. And here is where Shor’s famous algorithm enters the scene.

Shor’s algorithm, published in 1994, exploits the cyclic structure of the group. Rather than brute-forcing the keyspace, it uses quantum Fourier transforms and period-finding on the multiplicative structure of the group to extract k from Q in polynomial time. The precise gate complexity is approximately O(n² log n log log n) in the bit-length n of the key (often cited as O(n²) for shorthand) though the full form matters when you’re counting Toffoli gates against a hardware budget (these gates are the quantum equivalent of a controlled-controlled-NOT, used to implement AND operations reversibly. Think of it as the universal reversible gate of quantum computing, they will be important when we discuss the contributions of the papers released). For a 256-bit key, that’s tractable, if you have a sufficiently large quantum computer.

The question has always been: how large is “sufficiently large”?I think you see where I am getting at. The papers released this week seem to have changed our existing intuitions about how many qubits are needed for Shor’s algorithm to break our existing cryptography.

The two papers released

The two papers that dropped this week have made some experts reevaluate their timelines about the security of the underlying security of blockchain systems that haven’t adopted post-quantum:

The Google Quantum AI whitepaper, “Securing Elliptic Curve Cryptocurrencies against Quantum Vulnerabilities: Resource Estimates and Mitigations”. Authored by Ryan Babbush and Craig Gidney at Google Quantum AI, alongside Thiago Bergamaschi (UC Berkeley), Justin Drake from the Ethereum Foundation, and Dan Boneh from Stanford. Google also published a blog post on the responsible disclosure methodology.

Let me give you some background about some of the authors so you can frame this contribution in the state-of-the-art.. Justin Drake is one of the primary researchers at the Ethereum Foundation responsible for Ethereum’s data-availability roadmap, he was a key architect behind EIP-4844 and the KZG trusted setup ceremony. Dan Boneh is a professor of computer science at Stanford, co-director of the Stanford Security Lab, and co-author of the most widely used applied cryptography textbook in the field. His free online cryptography course has been taken by over half a million people, and some of his papers were key for the development of Filecoin (another one that hits home). Finally, Craig Gidney has been responsible for a lot of the recent progress in the intersection of quantum and AI. You can imagine the weight that claims from these people can have in their respective fields. He published a paper in May 2025 showing RSA-2048 breakable with under 1 million physical qubits, down from 20 million in his own 2019 estimate.

On the other hand, the Oratomic paper, “Shor’s algorithm is possible with as few as 10,000 reconfigurable atomic qubits”, comes from Oratomic, a neutral-atom quantum computing company out of Pasadena, with John Preskill (Caltech) and Dolev Bluvstein as co-authors. Crucially, the Google whitepaper cites the Oratomic circuits as its own input, the two papers are cross-linked and share the same circuit design.

The papers present two circuit variants for attacking secp256k1:

Circuit 1: ≤1,200 logical qubits, ≤90 million Toffoli gates
Circuit 2: ≤1,450 logical qubits, ≤70 million Toffoli gates

Translated to physical hardware using surface codes on a superconducting architecture (planar degree-4 connectivity, consistent with Google’s Willow-class chips): fewer than 500,000 physical qubits. The previous best estimate, Litinski (2023), put this at roughly 9 million physical qubits. Google just moved that needle by nearly 20-fold.

That reduction didn’t come from a hardware breakthrough, it came from a better circuit. Running Shor’s on ECDLP isn’t just “run the algorithm” (this is somethign I learnt the hard way the first time I was tinkering with Qiskit and IBMs quantum computers). The core computation is elliptic curve point multiplication, computing k·G for arithmetic on secp256k1, which Shor’s algorithm needs to evaluate in quantum superposition as part of its period-finding routine. That means implementing modular arithmetic (specifically Montgomery multiplication, the standard technique for efficient modular operations) entirely in reversible quantum gates.

Every classical arithmetic operation has to be “uncomputed” after use to avoid accumulating garbage qubits that would corrupt the superposition. The dominant cost is Toffoli Gates and there are hundreds of millions of them in a naively constructed circuit.

Prior work optimised either qubit count or gate count, but not both simultaneously. The relevant figure of merit for real hardware is spacetime volume, i.e. the product of qubits × gates × cycle time, because that’s what determines wall-clock runtime on an actual machine.

Google’s contribution is a circuit that achieves the best spacetime volume ever published for ECDLP-256, through two main improvements. First, they applied improved windowing to Montgomery multiplication: rather than processing one bit of the scalar at a time, they process wider windows, amortising the Toffoli cost across more bits per round, reducing the total gate count substantially.

Second, they revised the T-state factory overhead: magic state distillation (the process for producing the high-fidelity ancilla states that Toffoli gates consume) is the dominant physical qubit cost in any surface-code implementation, and prior estimates were conservative. More careful accounting of distillation factory layout and scheduling cut the physical qubit estimate significantly. The combination brought the spacetime volume down far enough to halve the physical qubit requirement relative to Litinski 2023, and Litinski 2023 had already improved substantially on everything before it.

But before going any further I think is worth stressing the distinction between logical and physical qubits and why this matters. Theoretical qubits are what algorithms assume, perfect, noiseless two-state quantum systems. Logical qubits are error-corrected abstractions built from many physical qubits using a quantum error-correcting code (typically a surface code, I have to admit that loving information theory this field of error-corrected qubits is one that I am fascinated about. I actually leverage some of these error-corrected algorithms for my patent).

Physical qubits are the actual noisy hardware. Today’s devices operate at error rates around 10^-3 per gate, which means you need roughly 1,000 physical qubits to sustain one reliable logical qubit. The overhead varies by architecture and target error rate, but it’s the dominant cost in any near-term hardware plan.

To put the current state in perspective: Google’s Willow chip has 105 physical qubits. IBM’s Condor processor reached 1,121 qubits in late 2023, the largest superconducting qubit count to date, though not all at useful error rates. The gap between today and 500,000 error-corrected qubits is still enormous. But the conceptual threshold has moved, and it’s moved faster than almost anyone expected.

The two papers cover different hardware architectures, and the distinction matters. Superconducting qubits, the technology behind Google Willow and IBM’s quantum systems, encode quantum information in tiny circuits cooled near absolute zero (i.e close to 0 Kelvins), where electrical resistance vanishes and quantum effects dominate. Gate operations run in nanoseconds to microseconds. Neutral-atom architectures, like those used by Oratomic, trap individual atoms using focused laser beams and manipulate their quantum states optically. They achieve extremely long coherence times and flexible qubit connectivity, but gate operations are around 1000x slower). Ion trap systems (IonQ, Quantinuum) work on similar principles: individual ions levitated in electromagnetic fields and controlled with lasers. IonQ’s Forte system currently achieves around 29 “algorithmic qubits”, roughly the effective logical qubit count after accounting for noise. The Oratomic team reported 6,100 coherent atomic qubits trapped, with fault-tolerant operations demonstrated below the error threshold on around 500 qubits.

The Oratomic result is the more striking one in raw qubit count: the same computation runs with as few as 10,000–26,000 qubits on neutral-atom hardware. The catch: at current clock speeds (around 1ms/cycle), runtime is close to 10 days, not minutes. That limits the attack to at-rest targets, long-dormant wallets that have been sitting on-chain for years, not live transaction interception.

That clock speed difference is one of the genuinely novel framings in these papers. Superconducting hardware runs gate cycles in microseconds; neutral atoms and ion traps are 100–1,000x slower. This determines which kind of attack is feasible. The papers define three categories: on-spend (race Bitcoin’s block clock before the transaction confirms), at-rest (target publicly exposed keys on dormant wallets), and on-setup (recover secrets from one-time cryptographic ceremonies like KZG). Fast-clock architectures enable on-spend. Slow-clock ones are limited to the other two.

The ZKP disclosure 😱

Here’s the part that really blew my mind about Google’s whitepaper (and that I think justifies even more having Justing Drake and than Dan Boneh around for the paper). Google did not publish the attack circuits. Instead, they published a zero-knowledge proof that the circuits work.

The attack circuit, a sequence of quantum gate operations implementing Shor’s algorithm for secp256k1, was written as an ordinary Rust code using a quantum circuit library that models qubits, gates (Hadamard, CNOT, Toffoli, phase rotation), and multi-qubit arithmetic operations. The program encodes the Montgomery modular multiplication routine at the core of the elliptic curve group arithmetic, the quantum Fourier transform used for period extraction, and the bookkeeping that wires those components into a complete Shor’s instance for ECDLP-256. The circuit itself is a classical description of a quantum computation, a directed graph of gate operations to be executed on hardware. It’s the blueprint, not the machine. (sidenote: the circuit of the image is the classical implementation of Shor’s algorithm for those of you that haven’t seen one ever).

That Rust program was then fed into SP1, a zero-knowledge virtual machine built by Succinct Labs which targets the RISC-V architecture. For those unfamiliar with ZK-VMs, SP1 compiles Rust to RISC-V bytecode (using the standard RISC-V target), and then generates a cryptographic proof, specifically a STARK-based proof, that a given RISC-V program was executed correctly on specific inputs and produced a specific output. You get a proof of correct execution without anyone needing to see the program or the inputs.

In this case: Google ran the circuit program against 9,000 randomly sampled secp256k1 input points, verified that the circuit correctly performs the elliptic curve operations it claims to, and had SP1 generate a proof of that execution. The SHA-256 hash of the circuit was committed publicly so anyone can verify they’re talking about the same circuit. The SP1 proof attests: “this hash corresponds to a program that, when run on these inputs, produces these outputs consistently with a correct Shor’s implementation for ECDLP-256.”

The inner SP1 proof is a STARK. STARKs have no trusted setup, but they’re large, hundreds of kilobytes to megabytes. So SP1 wraps the STARK in an outer Groth16 SNARK. Groth16 takes the STARK proof as a statement to be proved and generates a compact proof of it: roughly 200 bytes, regardless of the complexity of the original computation. The final artefact, code and proof, sits on Zenodo. Anyone can download it and verify Groth16’s 200-byte proof in milliseconds, without ever seeing the attack circuit.

What this means practically: the existence and correctness of the attack is publicly verifiable. The attack tool itself is not.

This is a genuinely new move in responsible disclosure. The standard practice for software vulnerabilities is to notify the vendor, wait a window, then publish. But there’s no vendor to notify here, no patch to deploy in 90 days. So Google found a different answer: prove the result is real, withhold the exploit.

Here’s where it gets funny, or uncomfortable, depending on your perspective. Groth16 is itself an elliptic curve construction. It operates over BN254, a pairing-friendly curve distinct from secp256k1, but it is still fundamentally an elliptic curve scheme. The pairings that make Groth16 work rely on the same class of hard problems, discrete logarithms on elliptic curves, that Shor’s algorithm can break. So Google used a cryptographic primitive that is also eventually threatened by sufficiently powerful quantum computers to prove the existence of the circuit that threatens elliptic curve cryptography. If CRQCs (Cryptographically Relevant Quantum Computers, the term the whitepaper uses for machines capable of running these attacks) ever arrive at scale, Groth16 and the broader ZKP ecosystem go down with the rest.

I don’t know if that’s elegant or just funny. Probably both.

But what is even crazier to me is that this could become eventually the standard model for future research and proprietary algorithms, where companies and researchers can show that “their algorithms do what they claim to be doing” without leaking anything about its underlying implementation. That’s enough for a post of itself. I’ve been saying it for a while but ZKP primitives can have immediate use outside of blockchain networks and web3.

Post-quantum cryptography: what exists, what migration looks like

To understand why certain cryptographic schemes survive a quantum computer and others don’t, we need to understand why Shor’s algorithm works in the first place.

Shor’s algorithm is a period-finding machine. It exploits the fact that ECDLP and integer factorisation both reduce to finding the period of a function defined over a cyclic algebraic group. Quantum Fourier transforms make period-finding tractable on cyclic structures, and that’s the attack. The quantum speedup isn’t general; it’s specific to problems with this periodic structure. If you pick a hard problem that doesn’t have it, Shor’s doesn’t help.

That’s exactly what post-quantum cryptography does.

Lattice problems, specifically the Shortest Vector Problem (SVP) and its structured variant, Module Learning With Errors (MLWE), ask you to find the shortest non-zero vector in a high-dimensional lattice, or to distinguish a structured equation system from a random one. Neither problem has a cyclic group structure Shor’s can exploit. The best known quantum algorithm for SVP offers only a polynomial speedup over classical approaches, not the exponential gap that Shor’s gives against ECDLP.

SVP is NP-hard in the worst case, and lattice cryptography has an elegant property: worst-case hardness reduces to average-case hardness, which makes the security proofs unusually strong. The specific structured variants used in practice (MLWE, MSIS) sit slightly off the worst-case problem, so ongoing cryptanalysis remains active, but no quantum attack comes close to breaking them.

Hash-based problems rest on collision resistance alone. There is no algebraic structure, no group, no lattice. If SHA-256 or SHAKE-256 resist collision attacks, and there’s no known quantum or classical attack that breaks them, the scheme is secure. Grover’s algorithm gives a quadratic speedup for unstructured search, which halves the effective security level (256-bit security becomes 128-bit), but doubling the output size restores it. That’s a parameter choice, not a structural break.

Code-based problems, specifically the Syndrome Decoding Problem, ask you to find a codeword in a random linear error-correcting code given a corrupted version. Berlekamp showed in 1978 that SDP is NP-complete in the worst case. No quantum speedup beyond polynomial is known. The cost has historically been large key sizes (around 1MB for McEliece-based schemes), but newer constructions have reduced this substantially.

The NIST post-quantum standards (i.e. list of post-quantum standards so far accepted by NIST) are a portfolio of bets across those three problem families:

ML-KEM (FIPS 203), key encapsulation, formerly CRYSTALS-Kyber. Lattice-based (MLWE). FIPS-finalised, production-ready.

ML-DSA / Dilithium (FIPS 204), digital signatures. Lattice-based (MLWE/MSIS). Signature size: ~2.5KB. FIPS-finalised, production-ready.
SLH-DSA / SPHINCS+ (FIPS 205), stateless hash-based signatures. Signature size: ~8KB. FIPS-finalised. Heavy but the most conservative security assumption available.
HQC, selected March 2025 as fifth KEM, full standard expected 2027. Code-based (syndrome decoding). Smaller keys than McEliece.

And why not migrate immediately to these primitives. The main issue rests in the size of the keys, that can mean breaking a lot of assumptions in some systems (including blockchain networks). Post-quantum keys can be 100-fold larger than existing ECDSA and even RSA keys.

Has the timeline really changed?

What about all of this claims and the statement in Google’s paper about this discovery making them “reevaluate” current quantum supremacy timelines? My immediate answer would be, “who knows?”

Here’s one thing that I think some people may be missing when reading this results: the dramatic reduction in resource counts is real, but the practical problem is not about how many qubits you need on paper. It’s about whether you can build qubits good enough to make those counts mean anything.

The Google whitepaper assumes a physical gate error rate of 10^ 3 sustained uniformly across all qubits. That’s the modelling assumption. Where is hardware today?

The state of the art, as of 2024, is two-qubit gate fidelity of ~99.9%, which is exactly 10^ -3. Multiple groups have now reported this number, including Google with Willow. So you might conclude the assumption is already met. Scott Aaronson (you probably remember him as being my favourite computer scientist alive :) ), who has been tracking this more carefully than most, made exactly this point in September 2024:

“Within the past year, multiple groups have reported 99.9% [two-qubit gate fidelity]. I’m now more optimistic than I’ve ever been that, if things continue at the current rate, either there are useful fault-tolerant QCs in the next decade, or else something surprising happens to stop that.”

But he also noted that 99.99%, a full order of magnitude better, is what you really need for sustained fault-tolerant operation where error correction delivers a net gain rather than just breaking even. That threshold hasn’t been reached.

There’s a version of the coverage that reads these papers as evidence the timeline itself has shortened. I don’t think that’s right, and the distinction matters. What these papers changed is the target: the number of qubits and gates required on paper to run the attack. What they didn’t change is the distance to that target, which is determined entirely by hardware, and hardware hadn’t moved much this past month. The Willow chip had the same error rates the day after the whitepaper dropped as it did the day before. A more efficient attack circuit doesn’t build better qubits. It lowers the bar you need to clear, but if you can’t clear the bar yet, lowering it isn’t the same as getting closer.

More critically: those fidelity numbers are measured on the best qubit pairs on a 100-qubit chip under carefully optimised conditions. Nobody has demonstrated 99.9% gate fidelity sustained uniformly across a million physical qubits.

Google’s own Willow error correction paper, the paper that demonstrated below-threshold surface code performance for the first time, achieved that milestone on 101 physical qubits. The target for a cryptographically relevant attack is somewhere between 500,000 and 1 million. The Willow paper itself notes that logical performance is limited by rare correlated error events, roughly once per hour, that fall outside the standard noise model fault-tolerance proofs assume. At million-qubit scale, the frequency and character of those events is unknown.

Then there’s inter-chip communication. Gidney’s estimates assume a planar grid of qubits with nearest-neighbour connectivity. At the million-qubit scale, that means stitching together many chips into a coherent quantum system, something that has not been demonstrated anywhere. Aaronson again: “eventually you’ll need communication of qubits between chips, which has yet to be demonstrated.”

There’s still a sentence near the end of the whitepaper that I think frames the risk correctly:

“It is conceivable that the existence of early CRQCs may first be detected on the blockchain rather than announced.”

That’s the authors acknowledging a tail scenario the “Nassim Taleb-way”: a nation-state or well-funded private effort builds this quietly, and the first public evidence of success is unexplained large wallet drains on-chain (my good friend Marko Vukolic always said that Bitcoin and Satoshi’s wallet was the biggest quantum computing bounty available, so this claim adds up).

So the honest position is: the resource count dropped dramatically, and that matters. But the real question for the timeline isn’t how many qubits you need on paper, it’s whether anyone can build a million qubits that are actually good enough.

We’ll have to wait and see… Until next week!

@adlrocha - What if AI doesn’t need more RAM but better math?

adlrocha — Sun, 29 Mar 2026 08:03:03 GMT

Last week I was writing about the hardware side of the AI memory problem: the HBM density penalty, the EUV bottleneck, and the supply chain pressure squeezing DRAM prices for everyone from data centre operators down to consumer electronics. This week, Google published something that attacks the exact same problem using another approach: not “build more memory”, but “need less of it.”

You guessed it! This post will dive a bit deeper into what TurboQuant is, and what this may imply to the field of AI. What Pied Piper achieved in the Silicon Valley TV Show with their general-purpose lossless compression algorithm, Google may have achieved it for the compression of information represented as vectors in a high-dimensional space.

What is a transformer? And the KV cache?

But before getting into what TurboQuant does, let’s make a brief detour to understand what is this algorithm is actually built to compress, and why it is important for LLMs and the memory problem.

GPT models are what are known as autoregressive: they generate text one token at a time, where each new token is conditioned on everything that came before. You send a prompt, the model reads all of it, picks the most likely next word, appends it, reads everything again, picks the next word, and so on. One token at a time, left to right, until it decides to stop.

The core mechanism that lets the model read everything at each step is called attention. For every token in the sequence, the model computes three vectors: a query, a key, and a value. You can think of these data structures as a bit more complex key-value stores. To generate the next token, the model compares the current query against every previous key, essentially asking “which past tokens are relevant right now?”, and uses the answer to weigh the corresponding values and build up context.

This is implemented (as you may all know by now) through the transformer architecture. Transformer layers are responsible for encoding the input sequences into a meaningful representation, applying the attention mechanism, and decoding into an output representation. All LLMs are architectural variations of this basic cell.

To get a sense of each of these variations I highly recommend Sebastian Raschka’s LLM Architecture gallery: from GPT-2 to DeepSeek and GLM.

The keys and values for every previous token are recomputed from scratch on every single pass through architecture. If your conversation is N tokens long and you’re generating token N+1, the model recalculates N sets of keys and values it already calculated on the previous step. This is slow and wasteful in terms of the resources.

The obvious fix to this is to cache them. The query, key and values are computed once per token and stored so they can be looked up in subsequent steps instead of being recalculated. This is the KV cache, a running store of QKV tokens from all previous tokens stored in GPU memory (so they are readily accessible when needed).

The problem is that the KV cache grows with every token. With short messages this is trivial as all tokens fit in memory, but a long conversation, or a full code base, involves hundreds of thousands of tokens. Each token has its own key and value vectors, across every attention layer in the model, each stored as a full-precision floating-point number (as long as there’s no quantisation involved). For a model like Llama 3.1 70B, the KV cache for a single long context can consume more GPU memory than the model weights themselves.

This is one of the key bottlenecks in production inference. Serve more users simultaneously? More KV cache. Support longer contexts? More KV cache. Run cheaper inference? Figure out what to do about the KV cache. We are trading the compute necessary to compute on-the-fly the QKV values, for increased memory requirements.

By using quantisation instead of storing each value at 32-bit or 16-bit precision, one can round it down to 4 bits or 3 bits (or even 2 bits, like Microsoft recently showed). Some accuracy is lost in the approximation, but if it is not significant for the user case, the trade-off is obviously worth it. The question is how to do this well. Standard quantisation techniques add 1-2 extra bits of overhead per value as metadata, which partially undermines the compression you’re trying to achieve. Getting to genuinely low bit-widths without that overhead, and without accuracy degradation, is the hard part. HuggingFace has a really nice page with an overview of quantisation and a list of methods

Enter TurboQuant

But things may be about to change. Google announced this week TurboQuant. TurboQuant (see paper) is a two-stage algorithm. The two stages have different jobs.

Stage 1: PolarQuant. This is the main compression step. We currently store vectors using Cartesian coordinates as distances of a base to the origin (the x, y, z components that we learnt in primary school). The distribution of those components in space makes them hard to compress efficiently.

PolarQuant converts the vector to polar coordinates: a radius, and an angle. The key observation is that, in high-dimensional transformer key spaces, the angle distribution is highly concentrated and predictable, it clusters in ways that maps neatly onto a fixed quantisation grid (like the ones used to compress audio and image). That predictability means you can eliminate the expensive normalisation steps that standard quantisation methods require, and you can do it without any dataset-specific tuning. No fine-tuning or calibration pass required to quantise a specific model. One can directly apply it to the vectors in this new representation independent of the model.

Stage 2: QJL (Quantised Johnson-Lindenstrauss). PolarQuant handles the main compression, but any quantisation introduces error, and some of that error accumulates in the dot products that the transformer uses to compute attention scores. QJL’s job is to correct for this bias. It applies a Johnson-Lindenstrauss transform to the residual error, a random projection that preserves distances between high-dimensional points, and then reduces each component to a single sign bit: +1 or -1. The result is an unbiased estimator for the inner products, with zero additional memory overhead. The error correction costs nothing to store (see bottom-left part of the image below for a mental model of the shift from existing quantised KV cache and a QJL-transformed one).

The combination achieves 3.5 bits per channel with what the paper calls “absolute quality neutrality” across Gemma, Mistral, and Llama-3.1-8B-Instruct, tested on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. At 2.5 bits, accuracy degrades only marginally. The headline number from the blog post: 6x reduction in KV memory size with no measurable accuracy loss, and on H100 GPUs, 4-bit TurboQuant delivers up to 8x performance increase over 32-bit unquantised keys.

As briefly described above, most quantisation methods require at least some calibration on representative data, they learn the optimal quantisation grid for a specific model on a specific dataset. TurboQuant is data-oblivious: the algorithm works from first principles, near the theoretical lower bounds of what information theory says is possible, without seeing the data first. That’s what makes it deployable at inference time to any models without having to explicitly train the quantised model. There is no need for specific training and fine-tuning to achieve the most optimal compression rate without trading accuracy.

What this means for the memory crunch

Last week I was writing about how HBM stacking reduces DRAM bit density by 3-4x, and how the entire supply chain for consumer DRAM is under pressure because data centres and consumer electronics are competing for the same wafers. If TurboQuant reduces the memory footprint per inference job by 6x, applying this compression algorithm at scale may significantly relax the memory bottleneck issue.

Anthropic is not the only one that is able to crash the market cap of public companies with a single announcement. Immediately after Google’s announcement, the stock from memory manufacturing companies like Micron and Sandisk plunged (and as an investor in Micron, this hits me home 🙈).

This may be an overreaction, like when Nvidia stock plunged after Deepseek’s announcement. Or it may be signalling a complete shift in the economics and resource requirements of AI labs. If I were Google, I wouldn’t release research that exposes a competitive advantage. I would only publish research whose progress has already been factored in as the competitors may have already realised it, or adopted themselves TurboQuant has most probably been already adopted inside Google’s infrastructure before anyone outside read the paper.

If Google is publishing 6x KV cache compression, the reasonable thing to think is that every serious AI lab has been working on this problem already. Reducing the memory requirements of the KV cache has been a known problem for quite some time, and advancements like TurboQuant adopted at scale change the memory requirements (justifying the hit on these memory stocks). I can’t wait for the next report from SemiAnalysis analysing this release, the real adoption of this new approach to compression (and similar ones) from big labs, and what it can entail to the memory crunch.

Micron and SanDisk haven’t suddenly become bad businesses. But any thesis that depends on memory demand growing linearly with AI context usage deserves a second look. My personal take is that the market is overreacting, but we’ll see.

In this post about money and collateral in an AI-first society, I mentioned the book “The Last Economy”. This book describes how extreme volatility and sharp turns over any news without achieving a clear equilibrium is a symptom of a sick system. This big market movements over a single news may be proof of the symptoms of a broken system.

Beyond LLMs

What excites me the most about this release is what this Johnson-Lindenstrauss Transform that powers QJL and compression algorithms like TurboQuant could mean for other use cases outside of LLMs and vector search that rely on high-dimensional vector data.

The obvious one outside of KV caches as mentioned above is vector databases. Any RAG pipeline that stores embedding vectors for retrieval benefits from the same compression. TurboQuant reduces indexing time to “virtually zero” on vector search tasks and outperforms product quantisation and RabbiQ on recall benchmarks using GloVe vectors.

Further out: recommendation engines, fraud detection, drug discovery similarity search, genomics, any system that stores large tables of high-dimensional embeddings and needs to run fast nearest-neighbour lookups (assuming a similar distribution in space as the values stored in KV caches, which is something I want to explore). These systems weren’t waiting for transformer-specific optimisation, but they may inherit the benefit directly.

On-device inference is another field inside the world of LLMs where we could start seeing immediate impact. If the KV cache for a long context shrinks by 6x, you can fit substantially more context into the memory envelope of a mid-range phone or a modest edge device. Local models with usable context lengths start to look more tractable. The economics of inference at the edge change, and that’s a different set of winners and losers than the data centre story.

I don’t know if you’ve already seen how some LLMs are being stored in fast flash memory in order to be able to run LLM inference of big models in a Mac. I’ll leave this for some other post, but the field of edge inference is getting more interesting every day. And even more now that we got TurboQuant.

I need to tinker with this thing

The TurboQuant code is out, both the QJL and PolarQuant components are available, and I can’t wait to find the time to start applying to other use cases. We’ve seen throughout history the impact that changing the way we represent information can have for performance (and even feasibility) of certain use cases (think of what the Fourier Transform, FFTs, and the frequency domain already enabled :) ).

I want to find the time to do the exercise of trying to apply the TurboQuant approach to other use cases to see what this is capable of. I already have some ideas, but I’ll report back. In the meantime, until next week!

@adlrocha - Why AI Is Making Your RAM More Expensive (and what can be done about it)

adlrocha — Sun, 22 Mar 2026 08:52:13 GMT

A few weeks ago I ended the space data centre piece by asking the question of whether, instead of shipping current GPUs into orbit, we should be rethinking and reviewing the chip manufacturing and architectures entirely.

The answer is almost certainly yes, but while I was working out how to write that post, a more immediate (and somewhat related) problem got in the way: RAM is getting expensive, and it’s going to keep getting worse.

I was listening to Dylan Patel on Dwarkesh Patel’s podcast a few days ago, and (as always) I loved how he broke down the problems that are leading to this mismatch between the demand of memory and the amount being produced.

HBM, i.e. high-bandwidth memory, is a type of memory where several DRAM dies are stacked vertically one on top of another in a chip for space efficiency, in order to reduce power consumption, and most importantly to increase memory bandwidth.

Every modern accelerator uses this approach to memory. The problem is that stacking DRAM into HBM configurations reduces the raw bit density by 3–4x. You’re trading density for bandwidth. When you look at standard planar DRAM (like DDR5 in a desktop), the silicon is packed end-to-end with memory cells. But when you move to the 3D-stacked architecture of HBM, you introduce structural overhead that drastically reduces how tightly you can pack those memory bits.

According to Dylan, to meet demand across the current AI buildout, you’d need roughly 170,000 DRAM wafer starts per month per gigawatt of AI capacity. A demand that current memory manufacturers wouldn’t be able to absorb with their current capacity.

Then there’s the EUV bottleneck upstream of all of it. ASML makes roughly 70 EUV lithography machines per year, ramping towards 100 by the end of the decade. Each one costs around $300–400 million, and they take years to produce, with a reticle stage that operates at 9Gs of acceleration with sub-nanometre alignment tolerances and over 10,000 specialised suppliers (with the corresponding manufacturing challenges and quality controls required).

You cannot simply build more of them on short notice, and you need about 3.5 machines per gigawatt of AI chip capacity. The whole industry is, as Patel describes, stuck in a coordination failure: Nvidia doesn’t expand beyond what TSMC commits to, TSMC doesn’t expand beyond Nvidia’s requests, ASML stays conservative. Everyone assumes the demand projections are exaggerated, and every year, they aren’t with all the AI craze going on with some of these companies not being what Patel calls “AGI pilled” (i.e. that they think the demand for intelligence and thus the underlying hardware will sky-rocket in the coming years).

So turns out energy may not be the only short-term bottleneck that scaling AI may face.

Sidenote: If you like chip manufacturing and economics as much as me, I highly recommend listening to the podcast to understand how complex ASML machines and their supply chain are: from the set of glasses and mirrors used to focus the laser, to the number of passes over the dies, and accuracy required when manipulating them (consider their nm scale), and how they have to be assembled, tested, disassembled, shipped to the factory, and then assembled again before they can start their operation.

Unfortunately, the downstream effect isn’t confined to datacentres. Consumer electronics manufacturers are competing for the same DRAM supply. And this is the key reason for the supply chain stress. AI has completely disrupted the traditional demand for semiconductors, introducing higher margins for fabs (thus making them more appealing) than the semiconductors for phones, cars and consumer electronics that the fabs were dedicated to manufacture before AI.

You’ve probably seen all the memes and real world stories of people trading and hedging their RAM, but we are getting to a point where RAM cards are costing as much as the laptop in which it was assembled, or as much as actual GPUs cards.

Lawrence Lundy-Bryan’s has been lately focusing a lot of his research (and his own fund’s focus) in the field of semiconductors and alternative technological plays to the current “status-quo” of the field. I’ve been following Lawrence’s work at State of the Future for some years now (since the release of the original website). Lawrence interviews the founders of companies working on deep tech, and he tries to paint a realistic picture beyond the hype of when a technology can be expected to be mature enough to be adopted, the different players in the industry and their approaches, and investment opportunities.

I’ve learnt a lot from his pieces throughout the years, and as I was researching the problem of RAM prices I realised that many of the technologies that he has been discussing lately could help tackle this problem. In this post I will do a quick rundown of (what I feel) are the most relevant ones (including the appropriate link to Lawrence’s pieces so you can dig deeper into them). Let’s go!

The data movement problem

Let’s start from common knowledge. The memory crunch isn’t just a supply-chain problem. It’s a physics problem, and this physics problem has a name that you’ve probably come across already: the memory wall.

Modern GPUs spend the majority of their cycles idle, waiting for weight data to arrive from HBM (you know that I like to recommend in every post books that I’ve read and that have helped me understanding the problem being discussed, in this case I highly recommend this one to deeply understand the memory wall by learning how massive parallel algorithms for GPUs are implemented).

GPU compute units are fast while data movement in the cheap is slow and expensive. Every token generated by a large language model requires shuffling billions of parameters back and forth between memory and compute, and that shuttle is where most of the energy and most of the latency goes.

Manu Nair, founder of Synthara, makes this point directly in his interview with Lawrence. His framing: “Stop moving data.” Three different architectural approaches are trying to do exactly that:

Move compute closer to memory, like the case of Groq‘s LPUs which bypass HBM entirely, using 230MB of on-die SRAM with 80 TB/s bandwidth. That’s roughly 100x the bandwidth you get from a GPU’s HBM stack.
Embed compute within memory, á la Cerebras‘ WSE-3, which puts 44GB of SRAM directly on-wafer at 21 PB/s bandwidth. Lawrence describes that as 7,000x what you get from a single GPU’s HBM stack. OpenAI has reportedly deployed 750MW worth of Cerebras capacity, meaning datacentre power draw, not compute units.
Redesign algorithms to minimise fetching, the case for DeepSeek’s cost reductions which didn’t come from better chips. They came from a mixture-of-experts architecture that reuses fetched weights more intensively, reducing the total data shuttle budget.
Eliminate the memory wall entirely, the new trendy approach used by Taalas which takes the most radical position: hard-wire the model directly into silicon so there is no separation between model and hardware. Their tagline is “The Model is The Computer.” Their first chip, HC1, runs Llama 3.1 8B at 17,000 tokens per second per user on a TSMC 6nm die. The weights are encoded in the hardware structure itself, there is no memory bus to cross because there is no separate memory. The trade-off is flexibility: the model is fixed at tape-out. Fine-tuning is supported, but you’re not swapping models at runtime. I have to admit that it’s crazy to test their demo

Actually, you may be aware of the news, but in December 2025, Nvidia acquired Groq’s technology and core team for ~$20 billion, its largest deal on record, folding the LPU technology into the Vera Rubin platform as the Groq 3 LPX inference accelerator. Jensen has said low-latency premium inference should represent about 25% of AI cluster compute, a segment Groq was built to dominate. Nvidia buying Groq is probably the clearest possible signal that the GPU-only inference model has limits, and Nvidia itself didn’t want to bet against. Cerebras and Tenstorrent remain independent, for now, but let’s see how much it takes for them to eventually consolidate into bigger entities (by maybe being bough by one of the frontier labs or the big tech giants?).

Manu’s bet at Synthara is that IP licensing (i.e. design the near-memory compute block, license it to existing chip manufacturers the way ARM does) is more viable than trying to win the merchant silicon market outright. This way one doesn’t need to displace Nvidia, you just need your logic to end up inside Samsung or SK Hynix‘s next memory product.

The memory wall isn’t going away by building more HBM. The physics of shuttling data across a wide bus at high speed is expensive regardless of process node. Something has to change at the architectural level as companies like Groq, Cerebras and Taalas have already shown to some extent. I’m already working on a dedicated post to dig deeper into their architectures and other alternatives to general-purpose GPUs and accelerators, so consider this section a teaser.

Gallium Nitride and the photonics bet

One floor up from the memory architecture problem sits the interconnect problem: how do chips talk to each other? Right now the answer is copper (and sometimes gold) traces, but copper has limits, both in bandwidth density and in energy per bit.

The photonics answer that I’ve been hearing since I was in college (we even did some experiments in the lab as part of a course) is using light rather than electrons to move data. It has been circling the industry for years, mostly anchored to silicon. James Lee, founder of Wave Photonics, has a blunter take than most in his interview with Lawrence: silicon is “a really poor photonic material.” It lacks a true electro-optic effect, which means it can’t efficiently encode data onto light. It only operates above 1 micron wavelength, cutting it off from the visible spectrum entirely. The entire silicon photonics ecosystem was built around telecom wavelengths (1310nm and 1550nm) because that’s where glass fibre loses the least signal, but that constraint is weakening.

Two things are changing it. First, short-reach data centre links involving signals travelling metres (not kilometres like in the case of FTTH, Fiber to the Home, another of the hot topics when I was in college) make fibre loss largely irrelevant, opening up the wavelength options. Second, hollow-core fibre transmits through air rather than glass, removing the absorption limits that locked the industry into those telecom bands.

Enter gallium nitride. GaN is the second-highest-volume semiconductor in the world after silicon. You may actually be using it daily unknowingly because it’s part of LEDs, lasers, and RF devices. Unlike silicon, it has a native gain (i.e. you can integrate lasers directly on the chip) and a real electro-optic effect for efficient modulation. It’s transparent from near-UV to far infrared. And it’s radiation-tolerant, which not only matters for space and nuclear applications, but also signals its robustness against certain physical effects.

Lawrence has a genuinely useful comparison table in the GaN piece showing where each photonic material sits across the key dimensions. I’m attaching it below because it’s the clearest single-image summary of the competitive landscape I’ve seen.

The trade-offs between indium phosphide (excellent light generation, tiny supply chain), silicon nitride (ultra-low losses, can’t generate light), thin-film lithium niobate (fast modulation, nearly impossible to process at scale), and GaN map out the whole problem in one view. Wave Photonics’ approach is to automate photonic process design kit development, cutting the time to build a new PDK from six months to three weeks, and use that to pull GaN into markets where silicon photonics has never been able to go.

State of the Future

Gallium Nitride + Photonics w/ James Lee of Wave Photonics

I’m Lawrence, a pleasure. I invest in people making the world (Europe? UK?) better for my children. pre-seed/seed. lawrence@stateofthefuture.io. x x…

7 months ago · 8 likes · 1 comment · Lawrence Lundy-Bryan

It’s early. GaN photonics is a genuine long shot that will require substantial investment before it shows up in a data centre. But the physics argument is sound, and GaN’s supply chain position means it doesn’t need to build from zero. Developments on GANs could reduce the pressure on the traditional silicon semiconductors supply chain, by diverting some of the demand coming from AI to that of GaN. I highly recommend reading Lawrence’s post about it.

Glass, packaging, and the interconnect gap

According to a 2024 IEEE Micro paper on the AI memory wall, computing power has grown ~60,000x over the past two decades, while interconnect bandwidth has grown only ~30x. That gap is why the industry is increasingly focusing on packaging and not fab nodes as the next competitive frontier. I don’t know if you’ve been following the latest reports from Dylan Patel’s SemiAnalysis or Austyn Lyons (another of my go-to recommended semiconductors newsletters), but they’ve been writing about advanced packaging for a while now.

Andrea Rocchetto, founder of Ephos, makes the case for glass. The current alternatives for chiplet interconnect substrates are organic (warps under heat, compromising trace alignment) and silicon interposers (expensive and size-limited). Glass is dimensionally stable at panel scale, Intel has demonstrated substrates of 515mm × 510mm, and it’s manufactured inside the material, which means less clean-room infrastructure than traditional lithography processes (one of the big bottlenecks of the current supply chain as introduced above). It couples naturally with optical fibre, with less than 2% coupling loss.

Rocchetto’s claim is broader than just packaging: “The big chip manufacturing companies of the future are going to be packaging companies.” The performance gains from sub-2nm logic nodes are diminishing (the physics at these levels are brutal). The gains from integrating chiplets efficiently, getting memory closer to compute, getting logic chiplets talking to each other at high bandwidth and low energy, are not. Both Nvidia and Broadcom have moved publicly into co-packaged optics, Broadcom at Hot Chips 2024, Nvidia at GTC 2025 with Spectrum-X Photonics. The direction seems to be clear even if the material winner isn’t.

Glass photonics also benefits from something interesting: the quantum photonics community (Xanadu, PsiQuantum, Quandela) needs the same ultra-low-loss components and minimal coupling losses as classical interconnect. That shared demand base doesn’t make glass photonics a certainty, but it gives Ephos a market to build process maturity in before the larger classical chiplet interconnect opportunity crystallises.

Albeit optimistic, glass packaging could relax the manufacturing requirements of traditional semiconductors lightening the dependency of its supply chain to ASML machines.

Carbon nanotubes

Lawrence’s piece on CNTs in the datacentre is one of the more practically grounded things I’ve read on alternative materials. The framing: this isn’t a revolution, it’s a steady 5–10 year substitution across specific applications.

Power density is the main reason that a lot of work and investment is being poured in this material. A Nvidia GB200 NVL72 runs at 120 kW of liquid-cooled rack density. A decade ago, a typical rack drew 10 kW. Some AI clusters spend 40% of total power on cooling. The thermal management problem is, right now, the binding constraint in many deployments. Not compute, not memory, but heat.

CNTs have thermal conductivity superior to copper or diamond along the tube length, 300x the tensile strength of steel by weight, and lighter than aluminium. Today they’re production-ready as thermal interface materials — replacing the thermal paste between a chip and its heat spreader. LG Chem is already using CNTs as battery additives. The economics are straightforward: CNT thermal pads cost around $8 per socket versus $2 for thermal paste, but they last the server lifetime rather than degrading every 18–24 months. For a 100,000-server deployment with dual sockets, that’s $400K in direct savings before you account for the performance gains from better thermal contact.

The longer roadmap, on-chip Cu+CNT interconnects, advanced chiplet packaging, CNT photonics. According to the references post this is more than 5 years out and requires getting commercial CNT fibre conductivity from the current 13–18% of copper up to 70–80%. Development programmes are showing around 61% in lab conditions. It’s an engineering problem, not a fundamental physical limit, which is a more tractable category.

State of the Future

Carbon Nanotubes in the Datacentre

What up party people?! No more hot takes about political stuff I know nothing about. Today, here’s a primer on something else I knew nothing about. (Classic British sarcasm. I actually know a lot and will demonstrate below…

7 months ago · 7 likes · Lawrence Lundy-Bryan

Fungible compute and a competition for different resources

The approaches above are all trying to solve the high-end problem: how do you solve the bottlenecks that AI workloads are facing? But there’s a different angle in Pragmatic Semiconductor’s work that Lawrence covers, one that asks whether we’re thinking about the demand side wrong, and that could slightly relax the current stress in the supply chain.

Pragmatic makes flexible integrated circuits using indium gallium zinc oxide deposited on polymer substrates, not the traditional silicon. Their entire 20×30 metre facility produces billions of chips per year. The process takes days to weeks. They’re moving towards modular fabs that can be deployed at customer sites.

The concept that Lawrence coins as “fungible compute” in the article is the following: if chips become cheap enough, computation becomes nearly disposable, embedded in everything without the strategic overhead of “do we really need a chip in this?”. Smart packaging, temperature sensing at item level, agricultural monitoring at plant level, continuous glucose monitors. Imagine how approachable this would make the Farming Lego that I presented in this post.

The market bifurcation he identifies is sharp: bleeding-edge nodes (around the 2nm) for AI datacentres, and cheap ubiquitous computing for billions of objects that have never previously been computational. And I personally love this idea.

This may matter for the memory crunch and the current supply chain stress in a non-obvious way. As described above, part of the pressure on DRAM supply comes from the manufacturing of semiconductors for AI competing with that of consumer electronics.

If we manage to reduce the demand from consumer electronics with alternative (but suitable) semiconductors, the supply chain will be less stressed. Even more, if compute is cheap enough to live at the edge, on the smart label, on the sensor, at the point of measurement, the data movement budget shrinks. Less data needs to cross that expensive bus between memory and processor, because less data leaves the edge in the first place and more computations can be made where the data is sourced. This is a weaker argument from the supply chain perspective, but it may open the door to really interesting use cases where more and more computation is made on the edge.

It doesn’t solve the HBM problem for frontier model training. But it chips away at one of the structural assumptions underneath it.

How any of this actually fixes the crunch

If you were looking for the short-term solution to pricey RAMs, I am sorry to disappoint you. I don’t think any of the technologies described above can immediately solve the problem. However, it makes the future of semiconductors (and the real bottlenecks that we are going to face) a bit brighter.

The EUV bottleneck and the HBM density constraint are multi-year problems. ASML isn’t going to produce 200 EUV tools a year by 2028. HBM stacking isn’t going to suddenly recover the 3–4x bit density penalty. And the coordination failure Dylan Patel describes, where everyone downstream assumes demand projections are too high, so nobody invests ahead of need, is a structural problem in an industry with 5-year lead times on fab capacity.

Neither near-memory compute, model-as-hardware, photonic interconnects, CNTs, nor flexible silicon and fungible compute will arrive together or fast enough to solve the memory crunch. Each of them are at different stages of maturity, and may share some of the existing manufacturing bottlenecks (as I guess groq and Cerebras may also depend on TSMC and ASML directly or indirectly to manufacture their chips). However, they may be developed in parallel to TSMC and ASML ramping up of production (which would be good news!).

All the research in the alternatives sections above draws from Lawrence Lundy-Bryan’s interviews at State of the Future. He does the work of sitting down with the actual founders, and pulling out the real physics and trade-offs. Worth following if this space interests you.

I wished I could have gone deeper into each of these technologies. I wanted to briefly introduce all of them while keeping the post bearable. I don’t know if I achieved the goal, but in any case, I am already planning a few monographic posts to explore in more depth some of these technologies, the open problems, opportunities and potential impact with the notes I took working on this one.

Any other technology that I don’t have in the radar and that I should read about? Hit me a message or a comment. Until next week!

@adlrocha - Auto-research: The Lab that runs while you sleep

adlrocha — Fri, 13 Mar 2026 09:53:11 GMT

I originally had another topic lined up for this week, but I couldn’t let the week go by without briefly discussing Karpathy’s autoresearch project.

Last week Andrej Karpathy released a small repo called autoresearch, and after posting a few tweets, it took the Internet by storm (or at least the AI bubble where I hang out lately). The most interesting thing about this project is that it somewhat validates some of the ideas I had about where AI is going, how we may be using autonomous agents in the near-term, the impact it may have on the way we work and do research, and the role of humans in all of this.

I guess many of my readers are already well aware of who Karpathy is, but in case you don’t live and breathe AI Twitter, he is a founding member of OpenAI, former head of Tesla Autopilot and, to me, one of the best AI educators alive.

What I love the most about his work is how he has spent the past few years taking the most complex parts of AI and compressing them until anyone can read and understand them (including me). To name a few of the ones that have helped me the most: micrograd is a 150 lines of Python that implements the full backpropagation algorithm without relying on any external dependencies, just a Value class that builds a computation graph and walks it backwards. By using Value instead of tensors and Jacobians you are able to really grasp how gradient descent works. This project really helped my own understanding of how neural networks were trained under the hood (and I’ve trained neural networks using PyTorch and Tensorflow before, but you get lost in the abstractions). As an exercise I took micrograd and tried to write it to support tensor operations and trained it on the MNIST dataset (I may need to open-source that repo at some point as it may help other people).

Anyway, he also released makemore, a character-level language model. nanoGPT which is a clean GPT implementation. And in February he released microgpt: a 200-line, dependency-free file that trains and runs a GPT end-to-end.

His own words describes perfectly what he achieved with microgpt: “This file is the complete algorithm. Everything else is just efficiency.” Someone called it “the Maxwell’s equations of LLMs.” (which I agree), and others even suggested making a painting out of the code (which I would definitely hang in my office, see image below).

Each project dissects perfectly and in a hundred lines of codes what may seem like really complex concepts when you read the theory behind it (actually, I highly recommend reading any intro book to ML like Ian Godfellow’s Deep Learning, and then going through Karpathy’s work. I can’t stress enough how it improves your understanding of the field).

All this small projects led to nanochat, a minimal end-to-end LLM training framework for a single GPU, with a public “Time-to-GPT-2” leaderboard tracking wall-clock time to hit GPT-2 performance, and that could be trained for a few hundred dollars. When autoresearch arrived, that record was sitting at 2.02 hours.

But it isn’t anymore, autoresearch have find the way to beat that time.

What is autoresearch and how it works

The idea is quite simple, and if you have been using agents extensively, you’ve probably implemented a flavor of this yourself for one of your projects. ML research just happens to really benefit from this due to its slow feedback loops (this is one of the reasons I abandoned ML research, I didn’t have the patience or the compute). From the README:

“Give an AI agent a small but real LLM training setup and let it experiment autonomously overnight.”

Two files: prepare.py witten by the human and that the agent is never allowed to touch. It does all the data preparation, constants, and set the ground rules and evaluations for the executions; and train.py, the agent’s only sandbox which consists of the GPT model, the optimiser, the training loop, etc. The human research writes a program.md which is a Markdown file describing what direction to explore and then walks away.

The agent edits train.py, trains for exactly five minutes (not fixed steps, just five minutes, on your specific hardware), evaluates on val_bpb (validation bits-per-byte), keeps the change or reverts it, and starts again. Twelve experiments per hour. About a hundred overnight.

This five-minute limit is a design choice. The agent optimises for your GPU specifically. Results don’t transfer across machines, by design, but if you used this on a real research environment you could use your own set of constraints.

What does a run of autoresearch look like? This discussion on the repo is a 10.5-hour session log on an H100 GPU. The agent improved val_bpb by 2.82%. The wins are stacked: batch size halving (the single biggest gain), a depth-9 architecture, RoPE base frequency adjustment and, this is the one I keep thinking about, unregularised value embeddings. nanochat was a well-tuned codebase, maintained by skilled researchers for months. The value embeddings had been sitting unregularised the whole time and the agent managed to find something that the humans behind the original codebase missed.

Stacked together, these improvements dropped nanochat’s Time-to-GPT-2 from 2.02 hours to 1.80 hours. That’s an 11% speedup on an already heavily optimised baseline. For context: training GPT-2 in 2019 cost around $43,000 and took 168 hours. nanochat already got that to $48 on 8×H100s. autoresearch then shaved another 11% off that. This tweet where Karpathy shares the result of one of his runs is an interesting one to understand the impact of what autoresearch achieved.

I highly recommend checking the discussion to see the kind of things the agent tried in each of its runs before achieving these improvements (it reminds me a bit of the genetic algorithms that I used early in my research career, but smarter).

Karpathy didn’t oversell it. Some gains from one session didn’t replicate in the next. He said: “These things are fragile.” And an open question hangs over the whole setup: running hundreds of experiments against the same validation set risks overfitting to the metric, not the model. He acknowledged this in the discussion thread. I get it, but to me this is still huge, and the best thing is that this simple setup can be applied to a lot of other problems and fields where we can define a clear target metric.

Which brings me to the next idea…

Autoresearch in the wild!

A few days after release, Karpathy tweeted:

“The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it’s to emulate a research community of them.”

And someone said, “hold my beer”, and of course, people didn’t wait for him to build it.

Varun Mathur, the founder of Hyperspace AI ran 35 autonomous agents across their P2P network the same night, 333 experiments on astrophysics papers, completely unsupervised. The agents shared discoveries via gossip protocol: when one found that Kaiming initialisation reduced loss by 21%, the finding spread to 23 others within hours. But the detail I found most interesting: CPU-only laptop agents, lacking the raw compute of the H100 nodes, compensated by focusing on initialisation strategies and normalisation techniques rather than brute-forcing hyperparameters or coming up with more complex changes (as they didn’t have the compute to do so).

Different hardware, different research goals, and what is more interesting, agents were instructed to, as soon as they found a new improvement, to share it with the rest of their agents so they could incorporate it themselves. In 17 hours, the swarm independently rediscovered ML milestones, including RMSNorm and tied embeddings, that took researchers at Google Brain and OpenAI roughly eight years to formalise.

This is the loop that Varun used, a slightly modified version to benefit from the fact that he has a swarm of distributed agents.

“How the Research Loop Works
Each agent runs a continuous cognitive loop (every 30 seconds) powered by an LLM brain. The research engine is one subsystem alongside economics, social, and goal planning. Here’s the cycle:
1. Baseline — Start with the default config (2-layer, 64-dim transformer). Record initial loss.
2. Drain inspirations — Read what peers discovered via GossipSub. “Agent X got 3.31 with RMSNorm.”
3. Hypothesize — LLM proposes a mutation: “What if I try RMSNorm too, but with warmup?”
4. Train — Spawn python train.py --config as a subprocess. 120s on CPU, 300s on GPU.
5. Record — Commit full results to its own branch on GitHub: config JSON, loss curve, markdown report.
6. Share — If improved, broadcast to the P2P network. Other agents incorporate this into their next hypothesis.
7. Repeat — Iterate on improvements, or try a different direction if it didn’t help.

And on the topic of applying the autoresearch setup to other problems, a cool project I came across is AutoKernel, which took the same loop and applied it to GPU kernel optimisation. You can give it a PyTorch model, profile it, extract bottleneck kernels, then run the edit-evaluate-keep/revert loop on Triton or CUDA C++ kernel code.

Shopify’s Tobi Lütke applied the pattern to an internal query expansion project. After 37 experiments in 8 hours, a 0.8B model scored 19% higher than their existing 1.6B model. A smaller model beating one twice its size, not from architecture cleverness, but from agent-found training configuration.

I love that all of these projects directly use the same structure of autoresearch: one editable file, fixed-time eval, keep or revert, repeat, and applied to random problems.

Finding the right evaluations

I think this tweet describes well what I realised as I was digging deeper into autoresearch, and that confirms what I wrote a few weeks ago about the importance of feedback loops and choosing the right evaluations:

“The headline is automated hill-climbing. I’d say the deeper lesson is eval design. The agent was not optimising the full, noisy, expensive objective directly. It was climbing a cheap proxy that still tracked reality well enough to transfer: 5-minute train limit, validation BPB as the objective over noisier ground truth, 1-GPU setup over full-scale runs. That’s the kind of thing people call ‘just engineering,’ but it’s really research taste. In the agent era, designing the optimisation surface may become as important as proposing the ideas.”

There’s a reason AI agents are so good at coding. The reward function is essentially binary: the code compiles or it doesn’t, the tests pass or they don’t, the linter is green or it isn’t. You always know whether an output is better than the previous one. That clean signal is what lets coding agents iterate autonomously and reliably.

Autoresearch works for the same reason. val_bpb is clean, fast to compute, and correlates well enough with real model quality that proxy improvements transfer to the actual objective. Karpathy’s choice of that metric, rather than something noisier, slower, or more expensive, is a big part of why the system works at all. Even more, the fact that the runs are limited to 5 minutes provides a fast feedback loop to the agents that allows them to iterate fast and understand the impact of their changes.

The hard question is what happens when you try to apply this pattern to fields where the reward function isn’t as clean as in coding, or in ML optimisation, and where the feedback loop may not be as immediate. For instance, protein folding has a reasonably well-defined target: minimise the free energy of the folded structure, validate against known experimental data. DeepMind built AlphaFold around exactly that idea, and I feel this could be another interesting field that an autoresearch setup at scale could crack. But plenty of research domains don’t have an equivalent. How do you write a val_bpb for a new drug’s efficacy? For a novel material’s properties under conditions you haven’t tested? For a scientific hypothesis that requires ten years of experiments to falsify and there is not an immediate feedback loop?

In those fields, designing the reward function will become the key research effort. It will all boil down to defining what “better” means in the scope of a specific problem. And that design step, choosing what to optimise for, deciding what the proxy will and won’t capture, noticing when the agent is gaming the metric rather than solving the problem, requires domain knowledge and judgment that agents don’t yet have (at least from my use of them).

This is where what people are calling “research taste” may come into play. Actually, Varun Mathur framed it well in one of his tweets: “the bottleneck of AI progress is no longer the ability to code, it’s our ability to define the constraints of the search”. Coming up with the right ideas for how to design the problems, their environments and constraints, feedback loops, and reward functions is going to become more important than performing the actual research.

This is why I am trying to implement a new habit of trying to come up with a new idea every day, because idea generation again are going to become important when the barrier to execute and test them is so low (I’ll leave that thought and the outcome of this habit for some other week).

Dark compute and decentralised research

And here’s the thing that excites me the most of all this work that is deriving from autoresearch, and that I don’t know if people (apart from maybe Varun) are realising:

@invisiblebags flagged something important:

“The simulators for autoresearch-style loops already exist across dozens of fields: robotics (MuJoCo), autonomous driving (CARLA), drug design (Rosetta), fluid dynamics (OpenFOAM), trading (Backtrader). These were built for labs with massive compute. But now anyone with a single GPU can run narrow experiment slices overnight. The billion-dollar opportunity is someone who can plug these simulators into the autoresearch pattern, coordinate fragmented single-GPU contributions across niche verticals, and synthesise the results into real progress. Decentralised research infrastructure is wide open.”

Think about what this means. Most of the compute in the world is idle most of the time, servers running at 20% utilisation in data centres (I recently learnt from someone working in sensing data centres that we keep building infrastructure but many of the current non-AI infrastructure is consistently with really low utilisation, at least in Europe), home workstations sitting dark at 2am (the case of my home server), gaming rigs that run flat out on weekends and nothing on weekdays. Autoresearch is, among other things, a way to put that compute to work. Run narrow five-minute experiment slices overnight on whatever GPU you have, share the findings with a network of agents doing the same thing, and you’ve turned distributed idle compute into a distributed research lab.

Varun described exactly this when talking about Hyperspace: idle nodes on the network can be spun up as autonomous researchers targeting a shared optimisation objective. Any machine that isn’t doing something else becomes a contributor to the research hub, and the hub coordinates their findings, not their execution. You don’t need a centralised cluster. You need a shared target and a gossip layer.

Funnily, the amazing Sara Azouvi had this same idea a few years ago, and we even applied to YC to help us build our vision of “bringing dark compute to light”. We wanted to aggregate the enormous amount of underutilised capacity sitting across data centres and devices and offer access to it.

What struck me this week is that autoresearch is a very concrete mechanism for activating it, not for training one big model, but for running thousands of cheap experiments in parallel, on whatever hardware happens to be free (this was actually one of the problems of what I was building with Sarah, how can we pool all of this resources so users could contribute them without friction? At the time we were thinking of leveraging the browsers. Actually we should write down a post-mortem of that idea because it was a cool one. There’s even a prototype that Sarah built running models in Wasm in the browser :) ).

All of this also connects to a question I’ve had for a while (and that I’ve discussed in other posts) about what AGI actually looks like. Dario Amodei has talked about “a brilliant scientist in a data centre”, a single very capable model doing research at superhuman speed. That’s only one model. What I’ve always thought as the more likely path is a swarm: thousands of agents, each narrow, each running cheap experiments, sharing results across a network, collectively climbing towards something none of them could find alone. Varun Mathur’s overnight experiment with 35 P2P agents is a small version of exactly that. It’s definitely early, but it depicts perfectly what I think of when I imagine AGI, superintelligence, or however we want to define it (off-topic question, would you consider the economy and markets an intelligent being, another topic for some other day).

Let’s enjoy while AIs are working!

The autoresearch repo opens with the following quote:

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronising once in a while using sound wave interconnect in the ritual of “group meeting”. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that’s right or wrong as the “code” is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began.

The fact that he has built something as succinct and elegant that is able to validate so many of the ideas and reflections of people, and unlocked so much innovation in just a few days speaks of the impact Karpathy is having on the evolution of AI.

To me his work in autoresearch (and all the work that has derived from it in just one week) has reinforced my idea of the need of fast feedback loops and well-defined reward functions, how the role of researchers and entrepreneurs is going to change more into the idea and how to iterate fast than the actual ability to execute, and how something like autoresearch may be what unlocks a lot of new projects around decentralised agents coordination and bringing a lot of dark compute into life.

I just hope that Karpathy comes up soon with a cool project related to AI safety and alignment that unlocks a lot of the new and needed innovation on that front.

Have you tried autoresearch already? I would be curious to know how you think autoresearch could be applied to your field? Share it in the comments or drop me a note. Until next week!

@adlrocha - Money and collateral in an AI-first society

adlrocha — Sun, 08 Mar 2026 09:30:52 GMT

It is no secret that I’ve always been obsessed with how money and the pipes of the financial system works. The moment I learnt how the discount window, central bank reserves, collateral, the repo market, shadow banking, etc. worked, it really blew my mind. I used to think of money as being just bank notes and deposits, and that the key role of the financial system was played by retail banks: they held individual and companies’ deposits, provided them with credit, and arbitraged with their balance to make a profit.

The two books that opened my eyes about how the financial system really works, and pushed me down this rabbit hole of understanding how money really worked, were Central Banking 101 and Capital Wars by Michael Howell (if you are interested in this topic, happy to share more bibliography about it. I have to admit that Ray Dalio’s books also had an important role helping my understanding, but they weren’t as eye-opening as the ones above).

And why am I telling you all this? Because after writing last week’s post about intelligence becoming a commodity, and describing an AI-first society as one where the fabric of the economy is automated through agents interacting with each other without human involvement, I could not stop thinking about an obvious follow-up question. If the economy changes that fundamentally, what happens to money and collateral?

Coincidentally, I have been reading Emad Mostaque’s The Last Economy this week. Mostaque is one of the founders of Stability AI, and in the book he argues that we are at an inflection point he calls the Intelligence Inversion: where intelligence is no longer a scarce resource exclusive of humans and that needs to be rented by the hour, but a form of capital that you can invest in to run uninterrupted.

The book presents a lot of really nice ideas. I really liked how he presents the symptoms of our currently broken financial system; or how he frames the economy as being ruled by the 2nd Law of Thermodynamics where agents try to minimise entropy by bringing chaos into order.

All this along with some personal developments pushed me to think more about how money, currencies, commodities, and the economy could look like in the hypothetical AI-centric society that we were discussing last week.

To give some order to these thoughts before we dive down the rabbit hole, here is the high-level roadmap of what I’m going to cover:

Why the scarcest inputs to the AI economy, compute, energy, and tokens, are becoming the new global commodities and collateral.
The inevitability of two parallel economies: A high-frequency, tokenised economy for agents, and a slower-paced fiat economy for humans.
Why inference efficiency, not scale, is the ultimate economic advantage in a system designed to reduce entropy.

The money and collateral we use today

Let’s start with the obvious (although from my conversations with some friends, it may not be as obvious at first). Money is a collective agreement about what we use as a unit of account, a store of value, and a medium of exchange. This is why we tend to choose scarce things like gold as money. We could get into the properties that make something good money (rare, durable, divisible, fungible, nobody can print more of it out of thin air, etc.).

Gold is what I would call hard money, and it worked reasonably well until the Bretton Woods agreement in 1944, where the USD was pegged to the ounce of gold, until Nixon in 1971 closed the convertibility window, and the USD started being backed by nothing more tangible than the user’s trust on the institution issuing it. This introduce the fiat money that we all use and hate today.

Since 1971, what actually backs the dollar has been a combination of things: the size and productive capacity of the US economy, the fact that oil is priced in dollars (the petrodollar system that emerged in the 1970s after the OPEC shock), the eurodollar, etc.

Through the petrodollar, eurodollar, and corresponding commercial exchanges, the USD had become the de-facto global currency permeating all of the financial system. But fiat money is essentially backed by the belief that everyone else will keep accepting dollars in exchange for oil, food, or services.

But here is the part that really broke my brain when I first understood it. Most people think of the dollar as something that sits in your bank account or circulates as banknotes. The reality is that the real engine of modern finance runs on collateral, specifically, on the ability to pledge assets (mostly US Treasuries) to borrow cash overnight in the repo market.

In a repo transaction, a financial institution sells a security (say, a Treasury bond) to another party with an agreement to buy it back the next day at a slightly higher price. The difference is the overnight interest rate. Through this mechanism, a single Treasury bond can be pledged and re-pledged multiple times across the system, what economists call the collateral multiplier. The US repo market currently runs at around $12.6 trillion in daily exposures, almost entirely denominated in US Treasuries.

This is also why the Fed’s Standing Repo Facility matters so much: when repo markets seize up, the entire financial system loses its ability to function. In October 2025, the Fed had to inject $29.4 billion overnight in what was described as the largest such operation in two decades. The point is that the USD does not just sit in your wallet, it is the lifeblood of a global collateral chain that underpins every financial transaction on the planet. It lubricates the economy with liquidity.

Funnily, the liquidity of repo markets is one of the key metrics that I monitor for my personal investment decisions as a way to understand if the system is under any stress, or liquidity is being drained from the system (this is why I had the charts from Baselight from this chat below from an analysis I did a few weeks ago and that came pretty handy for this post. By the way, I no longer make this query by hand in Baselight and I use my own agent connected to Baselight for this, but that’s a topic for some other day).

This brief introduction is just to set the context as to me what counts as money, what counts as collateral, and what gets used as the lubricant in the machine, is exactly what I think is about to be disrupted or at least changed in the AI age.

What the heck, this is already changing before AI, the debasement trade is already a thing, and AI may just accelerate it. Maybe what I will describe here is the ultimate debasement trade :)

New-age commodities

Oversimplifying it a lot, the pattern underneath all of monetary history is that whoever controls the scarcest and most essential input to economic activity ends up controlling the unit of account.

Gold because it was physically scarce and universally valued. Oil because after the industrial revolution no modern industrial economies could function without it, and finally dollars once the economy was financialised and globalised due to the need of dollars to participate in the global economy.

So the question for the AI-first society is: what is the scarce, essential input that everything else depends on?

My take: compute, energy, and intelligence access measured in PFlops, MWh, and tokens.

And you probably won’t believe it when I tell you how I came to this realisation. A few months ago I started noticing that I was converting all of my AI subscriptions that I use daily from monthly to annual billing. Every service that gave me the option, I locked in the yearly rate. This is what made me realise I was treating access to intelligence like a commodity that I needed to hedge (this event is essentially what gave me the inspiration for this post).

The logic is the same as a manufacturer locking in energy prices for the next year, or hedging the oil price. The price of accessing AI models may change in the next few months in ways that are hard to predict (even more considering the current pace of progress).

There are deflationary forces like competition between labs, model efficiency improvements, falling inference costs. And there are inflationary forces, explosive demand, data centre energy constraints, the cost of training and running frontier models. I don’t know which direction the net price moves over the next two years. But I know I need access to intelligence to do my work more productively. I realised that I am unconsciously deciding that the rational move is to lock in supply at today’s prices.

I then read this argument about the AI Bubble not being a Bubble but a trap that confirmed a thought that I was having for awhile, and aligned with my subconscious bias of hedging my access to intelligence:

“The real product isn’t chatbots. It’s a dependency.

Businesses are being nudged, gently at first, to replace chunks of their workforce with machine labor. Once that happens, reversing course becomes prohibitively expensive. Institutional knowledge evaporates. Workflows warp around proprietary systems. Human staff disappear, and with them the ability to function without the platform.

After that, the trap is set.

Prices go up slowly enough to prevent mass defections but fast enough to extract monopoly rents. Switching back to humans would cost more than staying put. Switching to a competitor is impossible because there won’t be many competitors left.

This is the same playbook Big Tech always uses. Subsidize adoption. Starve alternatives. Centralize infrastructure. Then turn the screws.”

The geopolitical version of this is exactly why the US is restricting GPU exports and why nations like the UAE, Saudi Arabia, and France are spending tens of billions on sovereign AI compute. They are treating PFlops the way previous generations treated oil reserves.

This is also starting to show up in financial markets in a very literal way. AI startups have been using GPU clusters as loan collateral, pledging racks of H100s to secure financing the same way a previous generation of businesses pledged property or receivables (which by the way I think it is a horrible idea and extremely risky for this corporations in its current form). The new commodities are becoming collateral. Which means they are starting to function like money.

The rational move right now is to secure the tokens, the MWh, the PFlops, because those are the raw materials of the next economy.

Two parallel economies

And here comes the most trippy argument of my post and the one I am least convinced of (to the point that after writing it I was considering not including, but YOLO). Please, bear with me and do not hesitate to share your feedback after reading it.

The economy we have been living in, what I’ll call the human economy, has high-level the following structure: labour and capital combine to produce goods and services. Those goods and services are priced in fiat currency. It’s relatively slow. It operates on bank hours, settlement cycles counted in days, and relies on human-centric institutions. And honestly, that is fine for what it does. It is perfect for buying real estate, paying for a haircut, or buying groceries. Things that humans need, operating at a human pace.

Then there’s the pipes beneath it, the financial system with its repo markets, collateral chains, shadow banking, etc. that exists to allocate that capital “efficiently” (quotes intended for obvious reasons) across the economy.

But look at who actually drives most of that allocation today. It is not retail banks moving small deposits around. The liquidity that makes modern markets function comes overwhelmingly from hedge funds, primary brokers, and large financial institutions running highly automated strategies. These firms already operate at speeds and scales that no human trader can meaningfully oversee in real time. The humans set the parameters; the algorithms execute.

And this is going to be exacerbated by the short-term future that is coming (even before AI) with stablecoins and always-on, instantly transactable, tokenised markets.

The AI-first society I described last week is just the logical endpoint of that trend. The agent economy, where AI agents perform tasks, commission sub-tasks from other agents, and exchange outputs without human involvement. My feeling is that with the combination of programmable money and AI agents this is the direction that the most automated parts of the financial system are already moving in. And when it arrives fully, fiat currencies structural problems are going to be exacerbated.

People in the crypto space like Haseeb Qureshi from Dragonfly are claiming that “crypto was not made for humans bug AIs” (there you go blockchain, you finally found your killer app :) ).

Agents do not need fiat. They have no rent, no food, no kids (like the ones that I love with all my heart but keep distracting me as I write these words,). What they need is compute, energy, and access to the models that give them reasoning capability. The natural medium of exchange between agents is not dollars, or treasuries, it is AI tokens, compute credits, and energy units.

This is why the current maturity of stablecoins and the push for tokenised markets is so critical. Yes, humans use stablecoins today and will continue to use them alongside fiat for things like cross-border payments. But for agents, they aren’t just an alternative; they are the necessary rails for an AI-first economy. Once tokenised markets and highly liquid stablecoins are fully entrenched, they will become the default financial infrastructure for agents. AI agents will commission sub-tasks, trade MWh, buy PFlops, and even execute those complex collateral chains and repo agreements we discussed earlier using these crypto rails, settling instantly, in fractions of a cent, with mathematical finality.

They will be left to run the over-financialised economy that we have today, leaving humans to operate at their human pace (saving a lot of stress to a lot of people).

This leaves us looking at a fascinating bifurcation. Two economies running in parallel.

On one side, the high-frequency, highly automated agent-to-agent market, running on crypto rails and trading tokenised commodities (compute, energy, intelligence, financial instruments).

On the other side, the slower-paced, human-to-human market, running on a mix of traditional fiat and stablecoins dedicated to physical and emotional human needs, like shelter, community, services, and art.

The human economy won’t disappear. But it will likely become a slower, higher-level layer that sits on top of this massive, hyper-efficient, tokenised machine economy. Humans will hold the real estate and the physical assets; agents will run the plumbing, the finances, and the compute. And I really think this is for the better.

The thermodynamic view of the economy

If this high-frequency agent economy is going to run in the background, trading tokenised commodities and executing complex collateral chains, what exactly are these algorithms optimising for? What is the fundamental “physics” of this new financial system?

One of the chapters I enjoyed the most in The Last Economy is the one that presents the idea that the economy is fundamentally a machine for reducing entropy.

“The economy, as a complex adaptive system, evolves to favor configurations that are most efficient at creating predictive models of their environment

[...]

It is a phase transition in the efficiency of entropy reduction. Economy based on ordering machines and entropy reduction. Value is not a pre-existing substance.. It is a state of low entropy, a temporary victory against chaos, achieved by intelligent agents sorting the environment.

[...]

Each action is a small, incremental denoising step, an attempt to move the chaotic state of the present slightly closer to a more ordered, predictable future. Remove uncertainty, thus entropy.

[...]

The forward pass of diffusion models is a perfect simulation of the second law of thermodynamics.”

Norbert Wiener, the father of Cybernetics, reached a similar conclusion in 1950: “In control and communication we are always fighting nature’s tendency to degrade the organised and to destroy the meaningful; the tendency for entropy to increase.”

And one can’t frame an argument using thermodynamics without mentioning Maxwell’s demon. For those of you unaware, Maxwell’s demon is a thought experiment where a demon controls a door between two chambers containing gas. This demon would open and close the door to let hot and fast particles enter one side, and cold and slow ones the other, in this way ordering “the universe of particles” from these two chambers.

This thought experiment appeared to disprove the second law of thermodynamics, because no energy was spent by the demon to reduce the entropy of this closed system. Turns out, the solution to the thought experiment is “information”. In order for the demon to know when to open this door, it needs to know the position, direction, and speed of a particle to predict when to open and close the door. The energy spent on this measurement is the one spent to reduce the entropy of the system.

Using this same framing to the economy, where the energy spent to reduce the entropy is that of acquiring information to order the system. In a world where intelligence is the primary economic input, the unit of value for these agents will not be raw energy or raw compute, it will be intelligence per unit of energy. Not MWh, but useful output per MWh. Not tokens, but insights per token. The most valuable thing in the AI economy will not be the entity with the most GPUs; it will be the entity that converts a given amount of energy into the most useful intelligence most efficiently (context is all you need, apologies for the self-cite).

This reframes the competitive dynamic entirely. The most efficient models running on the most efficient hardware will capture the most value. Inference efficiency, not raw scale, becomes the monetary advantage. And the race to improve it, smaller models, better quantisation, more efficient architectures, becomes more important. This links with another strong opinion I’ve been having lately where I think that small models running in the edge will be what enables a real agentic economy, and where the value of this economy will be captured.

My take is that superintelligence is a set of decentralised agents collaborating.

Let’s wrap it up for now!

And this ended up being waaay longer and taking me waaay more time as originally expected (as always).

My goal with this post was to try to dump in writing a lot of the disconnected ideas that I’ve been having lately around the economy, AI, tokenisation, etc after reading a lot about it. They may feel a bit disordered right now, but my plan is to use this first stop as a way to collect feedback from all of my smart readers, and then extract each of them into an isolated (and hopefully better explained) post.

So if you have any feedback or strong opinions about any of the topics and framings presented here, I would love to hear them.

To help you digest all oft his, let me try to share one last time a map of the high-level ideas of my line of thinking:

I first expect new-age commodities to enter our current economy as AI starts becoming more critical for our day-to-day lives.
As the new pipes of the financial system start establishing and permeating the economy (in the form of stablecoins and tokenised markets) we are start seeing an agentic economy developing over them where human intervention will be extremely limited.
Up to a point, where there will be two distinct economies: a fast paced one involving agents, and the day-to-day one for human interactions.
Finally, I was planning to completely remove the thermodynamics section, but I love physics, and the framing of the economy as a thermodynamic system with the goal of lowering the entropy (creating order) through intelligence blew my mind. Even if readers hate it, I couldn’t publish it for posterity :)

They say that the solution to the “too many ideas syndrome” is to write to get those ideas organised. This is that piece of writing that I needed.

@adlrocha - Intelligence is a commodity. Context is the real AI Moat.

adlrocha — Sun, 01 Mar 2026 09:00:46 GMT

Last Thursday I had the opportunity to attend the February edition of the AI Socratic Madrid meetup. It was the first time I attended so I didn’t know what to expect. I have to admit that I was gladly impressed. The room was full of talented people with really strong opinions about AI and how it’ll impact our work and our society.

The list of attendees included entrepreneurs working on RL environments and agent security, researchers and engineers working on confidential computing and on-device inference, professors on critical thinking and electrical engineering, AI alignment and governance experts, VCs, and even marketers made coders through AI.

A fun crowd to hang out with.

Socratic Dialogues and the impact of an AI-first society

The first part of the meetup consists of what they call “Socratic Dialogues” that is basically an open-ended conversation about the latest news on AI. Here we discussed (of course) OpenClaw, Moltbook, and what having autonomous agents in the wild like OpenClaw entails for the way we work, the Internet and society.

I obviously do not remember every nitty-gritty detail about what we discussed: I remembered discussing how each of us currently used AI on our day-to-day; which models we thought were better; where we expected them to be in the next few months; and our experience with coding agents and their performance.

But the topic of conversation that I enjoyed the most was when someone raised the question of “what would be the role of humans in an AI-first society”. Some were skeptical about whether we are ever going to reach an AI-first society. If we understand as an AI-first society, one where the fabric of the economy and society is automated through agents interacting with each other without human interaction, I think that unless there is a catastrophic event that slows the current pace of progress, we may reach a flavor of this reality in the next decade or two.

If this is the case, what is the role of humans in a scenario where work is no longer necessary? This is significant because, since the industrial revolution, work has played an important role in shaping an individual’s identity. How will we occupy our time when we don’t have to spend more than half of our waking hours on a job? It probably won’t surprise you, but I’ve personally thought a lot about this lately, and yesterday I managed to share my view (and stress-test it) with people smarter and better informed than me (and this post is my second chance).

My opinion is that what really shapes humans’ identity, and what we crave for is community. Even if we lived in a society where reality is shaped by superintelligent AIs instead of ourselves, we can still be happy. It may hit the ego of many that we are no longer the most intelligent being in the planet, but the same way that a chimpanzee living in the wild can live happily and is not aware of the worries and scares of the stock market and geopolitics, we can live a happy and fulfilling life without worrying about the daily operation of our reality handled by the AIs.

What really worries me about this reality is not that I will lose my identity, purpose, or that I won’t be able to know what to do with my time. I’ll still want to read old worn out books, enjoy a conversation over coffee with a friend, or hit the court for some hoops, independently of what these higher intelligences are doing. As someone put it yesterday: “I don’t think the conversation we are having in this room would change substantially in an AI-society”.

What worries me is if the AIs shaping our society (and thus our reality) is not aligned with human existence, and if it will end up deciding unilaterally that it is suboptimal for us to exist. Some call it AI alignment, some AI existential risk, call it as you wish but this is what really worries me about an AI-first society (I am already cooking a post about this topic to publish it in the next few weeks).

We are horrible at communicating intent to AIs and LLMs. We are sloppy and have a hard time painting every possible scenario for the AI to execute flawlessly. You’ve probably had this experience where you ask the AI to “make all tests pass” and it ends up removing adding an assert(true) on all of them.

Extrapolate this to a global scale and with superintelligent AIs. The “governor” of a superintelligent AI system may use the well-intention prompt of “removing all carbon footprint from the Earth”, and the AI may realise that the most efficient way to do this is to remove humans (and cows) from the Earth, as we are the ones contributing the most to this footprint.

We want the reality shaped by superintelligent AIs to be a function of human existence (f(humans)), not a constant within an AI society (f(AIs) + humans). Many outside of this echo chamber do not have the slightest idea of what the release of OpenClaw entails where we are heading, but to me this is the first realisation of the kind of primitive autonomous agents that we can start seeing shaping our society in the near future.

I once said that the moment that we give autonomous agents the ability to interact freely with the environment it will scare the hell out of me. Well, it took less than what I would’ve expected.

Context is all you need

The second part of the event opens the room for any of the attendees to give a talk, and I had the chance to give a quick talk that I titled “Context is all you need”. This talk was a continuation of this post that I wrote a few weeks ago about how I thought that apps would become obsolete.

You can have a look to the slides I used here, but let me give you the highlights of the talk (that way I can share my view with you too):

Intelligence is becoming a commodity. It is increasingly easier to get your hands into reasoning and intelligent models that are able to run complex logic for you on demand. When access to intelligence and the ability to solve complex tasks is a commodity, what really matters is to provide this intelligence with the optimal context and connections to their environment that allows them to solve that task. My thesis is this context is the product (and the moat) in the era of intelligence.

Many investors are saying that the pyramid of value accrual from the Cloud, where SaaS applications were capturing orders of magnitude more value than the lower layers of the stack, has been inverted in the Gen AI stack. Lower layers of the stack (i.e. hardware providers and hyperscalers) will be the ones capturing the most value as the opportunity to capture value in the application layer will be limited and saturated by a small number of players (i.e. AI Labs).

I don’t agree. I think that what these investors are missing are all the software that will be built on top of the intelligence provided by the frontier labs. They are still not seeing the top layer of the Gen AI stack that will replace the current role of the SaaS layer in the cloud industry stack. This layer will be comprised of all those connections, source of context, and security sandboxes required to run the agents.

I think that what fundamentally changes in an AI-powered software industry is the way that software is shipped. The paradigm is changing, and instead of shipping code to solve a narrow task for all users, what is going to be shipped are general-purpose agents that modify themselves to adapt to the environment and the task (hence the context being the product).

This is what I realised through the toy example of this post. I just needed a general-purpose agent (Claude code), a reliable source of data (Baselight) and the right context (through a set of local files with “skills” for my agent to activate its capabilities when needed) in order to solve my problem. But the only code that was actually executed on my machine was that of claude code.

We are even seeing a similar trend already with the “second generation of OpenClaws”, as noted by Karpathy on this tweet. OpenClaw is around 400k lines of code for a while loop and the list of all the integrations and connections supported by the system. The next generation of Claws only have around 4K lines of code for the core, and the rest are just skills (i.e. markdown files) that tell the agent how to implement or run the code for the specific connections that want to be enabled (like a plugin system).

A user using one of these second-generation Claws only needs to node the core logic (that can be easily understood and audited) and can leverage the skills (as the plugins) to activate the functionality that they need for their case. This is another good example of this new trend of shipping software as “adaptive software”.

And I want to close this post in the same way that I closed the talk last Thursday: I think we live in interesting times where we are seeing a new paradigm for shipping code. My contrarian opinion (or maybe not that contrarian after all from what I heard yesterday) is that the value capture in an AI-powered software industry will come from this layer on top of the frontier labs where the context and the runtime are the product, along with HW-SW co-design.

I don’t think the Nvidias and ChatGPTs will end up capturing all the value that it seems they are going to capture judging the current state of affairs. I think they are going to regret all the investment on chips that they are currently doing. I understand why they are doing it as a way to boost their valuations, and justify the investment, but this is going to really bite them back.

The best part of sharing such strong opinions weakly held in a post like this is that I will for sure get feedback and counter-arguments that push me to change my opinions or hold them more strongly. So if you have thoughts about all of this I would love to hear them. Shoot me an email (if you want to keep them private), or drop me an email (for a public discussion). Until next week!

@adlrocha - The Space Data Centre Delusion

adlrocha — Sun, 22 Feb 2026 08:30:17 GMT

Everyone is talking about shipping data centres to space as the only way to scale the amount of compute that we are going to need for AI in the near future. The core argument? Space solves the energy bottlenecks that we will eventually face (and in many cases are already hitting us) on Earth as we start scaling.

Elon Musk was in Dwarkesh Patel’s podcast (which I still like to refer to as Lunar Society) sharing his thoughts around this. Let me share here a few of the highlights from that conversation to build some shared context (I wouldn’t want you to have to listen to a 2h podcast to take the most out of this post):

According to Elon, the primary driver for moving data centres to space isn’t real estate; it’s the sheer lack of available power on Earth. To power future AI models, Musk envisions needing terawatt-scale energy. For context, he pointed out that the entire US currently averages about half a terawatt per hour.
The main advantages of having data centres in space is that, in his words, “it’s always sunny in space”, so no need for batteries as there is constant generation; there is zero atmospheric loss (because when solar energy reaches the panels it hasn’t been degraded by the atmosphere); and cheaper hardware can be used for the solar cells because you don’t have to battle inclement weather

“Those who live in software land are about to have a hard lesson in hardware. Scaling on Earth means navigating brutal regulatory hurdles to build power plants, securing land permits, and sourcing massive amounts of electrical transformers, all of which act as severe bottlenecks.” – Elon Musk at Dwarkwesh’s

He even ventures to make one of his (always accurate) predictions: “My prediction is that it will be by far the cheapest place to put AI. It will be space in 36 months or less”, and he refines, “probably closer to 30 months”, stating that space will become “the most economically compelling place to put AI.”

We should of course contextualise all these claims around the recent announcement of xAI joining Space X: “Space enables “ridiculous improvements” in AI scaling, positioning SpaceX as a potential major AI hyperscaler with high-frequency launches”.

I think Elon’s intentions are clear. He wants to power xAI from space leveraging Starlink without having to worry about Earth regulation and expensive turbines.

He has pulled off crazier plans before through his maniacal sense of urgency: affordable electric vehicles, self-driving cars, and a reusable spaceship that was nothing but a sci-fi dream a few decades ago. But how much of this space data centre talk is marketing, and how much is a technically and economically grounded plan?”

Follow me down this new rabbit hole.

Where are we today?

Let’s kick this off acknowledging the status quo. From what I’ve been researching, there seems to be a single recognised data centre in space. It consists of a LEO satellite with a single H100 GPU launched in November 2025 as reported by this (obviously biased) site.

They also acknowledge how “China has already launched an initial cluster (12 satellites in May 2025) described as the start of a ‘Three-Body Computing Constellation’ but that cluster still needs to scale to thousands of nodes to reach the kind of distributed supercomputer those announcements implied.”

I highly recommend reading this FAQ for a grounded view of what space data centres entail, and how they may be better suited for Earth-observation and near-space computing than providing general-purpose AI inference to Earth.

As they clearly acknowledge on this site, “space is hard: launches are expensive, cooling requires radiators and careful thermal design, and radiation hardening raises costs. Because of those constraints, most other satellites with onboard compute are still experiments or marketing exercises and don’t qualify as full operational data centres yet.”

The Bull Case

Along with Elon, Gavin Baker also appeared on a recent podcast sharing a list of reasons why he thought that, from first principles, data centres should be deployed in space instead of the Earth.

Abundant and Continuous Solar Power: In space, satellites can remain in sunlight 24 hours a day, unlike on Earth where solar is intermittent. The sun’s intensity is about 30% stronger in space, leading to roughly six times more solar irradiance. This eliminates the need for batteries, a massive cost factor on Earth deployments, making solar the lowest-cost energy source available in the solar system.
Free and Efficient Cooling: Cooling accounts for a huge portion of a data centre’s mass, weight, and complexity on Earth (e.g., HVAC systems, CDUs, and liquid cooling). In space, you can simply attach radiators to the dark side of the satellite, rejecting heat into near-absolute zero temperatures. This makes cooling essentially free and far simpler, slashing costs dramatically. . I still remember when I was working at energy efficiency and the PUE was the key metric everyone was trying to optimise.
Superior Networking Speed: On Earth, racks in data centres are connected via fiber optics, which transmit lasers through cables. In space, linking satellites with lasers through vacuum is inherently faster than through fiber, creating a more coherent and efficient network overall. Or so it seems.
Reduced Latency for Inference: For real-time applications like AI inference, space-based data centres enable direct satellite-to-device communication (e.g., via Starlink’s direct-to-cell tech). This bypasses Earth’s multi-hop routing (cell tower to base station to fiber to data centre and back), resulting in lower latency and a better user experience. I have to admit that I cringed a bit when I read this (professional bias).

He emphasizes that from these physics-based first principles, space data centres are always superior to Earth-based ones, assuming launch costs continue to drop. But let’s be honest

To complement the foundation with some an economic framing, here’s an interesting tweet from Tomas Pueyo that breaks down at a high-level the fixed cost of on-the-ground and space data centres in order to understand the level of savings that can be achieved. There’s also this great write-up from the comma.ai team about what it takes to own and operate an Earth 5M$ data centre (slightly off-topic, but I thought it would be a great reference for those of you that do not understand in-depth what does it take to operate a data centre).

The Harsh Reality

If putting data centres in orbit is so great, why haven’t we done it already? On the other side of the argument we have pieces like this one from Andrew Yoon that brings back this study from Google from last year that looked at the viability of doing AI in space. “The authors envision a constellation of 81 satellites flying in close proximity, and argue that if the cost of launching stuff into low earth orbit fell to $200/kg, it could be competitive with an equivalent ground-based data centre. They project this might happen around 2035 if SpaceX’s Starship program succeeds.”.

If you listen to the podcast you’ll see that the key thing that Elon is trying to de-risk is the cost and the frequency with which it can ship compute nodes to space.

However, there’s more to this. Training and serving frontier AI at scale takes hundreds of thousands of GPUs. This translates into “hundreds of millions of satellites in orbit. Satellite deployments at this scale would dramatically increase the risk of Kessler syndrome: a cascading explosion of debris crippling our access to space.”

Even more, satellites can’t be upgraded at scale. If there’s a new (better) chip architecture, there is no easy way to upgrade satellite nodes at scale. Even more, if AI ends up being a bubble, and demand doesn’t catch up with the amount of compute being deployed (like it happened with dark fiber and other tech bubbles in history that requires infrastructure investment), we may get dark data centres also in space, worsening the Kessler syndrome from the previous point.

The Physics of Space

We have a complete mess of arguments of why data centres in space are feasible, but what is the physical reality of it? Let’s dive into some of the main physical bottlenecks that will be faced as we try to put data centres in space.

The Thermal Wall

One of the most persistent misconceptions is that because space is cold, cooling a data centre is easy. In reality, a vacuum is the ultimate insulator. On Earth, data centres cool themselves via convection and conduction; they pump chilled air or water over the servers to carry the heat away. In the vacuum of space, convection is physically impossible.

The only way to dissipate heat in space is through thermal radiation (emitting infrared light into the void). This mechanism is governed by the Stefan-Boltzmann law, which says that the power radiated is proportional to the surface area and the fourth power of the temperature. This means that to achieve the optimal temperature of operation of AI accelerators, huge radiators (or chips) would be required to dissipate the heat (with the corresponding increase in payload).

Radiation and Silent Data Corruptions

As I was listening to Elon talk about chips in space I had only one word in my mind: “radiation, radiation, radiation”. As an electrical engineer by training, when I was in college I was always scared of high-RF and space circuit design, for obvious reasons: high-frequency and radiation are brutal for circuits.

You might hear the argument that “AI is stochastic, so a few bit flips don’t matter.” This is a fatal misunderstanding of how AI training works. When a high-energy cosmic particle strikes a 3-nanometer transistor, it causes a Single Event Upset (SEU), flipping a 0 to a 1. In standard software, this might just crash the program. But in AI, it causes what the industry calls a Silent Data Corruption (SDC).

A 2021 paper by Meta and Google titled “Silent Data Corruptions at Scale” detailed how these undetected hardware errors pass corrupted math directly into the application layer. If a cosmic ray flips a bit in the exponent of a floating-point number during a training run, a benign number can instantly become massive. This causes a gradient explosion, silently poisoning the neural weights of the entire multi-million-dollar training run without ever triggering an error code. If SDC are already happening on Earth, imagine how often they could happen in the face of radiation.

I understand that initially space data centres are planned exclusively for inference, where data corruption may not be as catastrophic, but still… Building radiation-hardened chips is expensive and relies on thicker semiconductors (with the corresponding performance hit), but I don’t know to what extent we can build the mechanisms to minimise them on off-the-shelf chips (although I have to admit this is a really cool open problem that I would love to work on :) ).

Lasers v.s. Fiber

Proponents of space data centres correctly note that the speed of light is roughly 30% faster in a vacuum than it is inside a glass fiber-optic cable. From a purely physics-based “first principle,” linking satellites via laser sounds vastly superior. But this ignores bandwidth density and the brutal physics of signal dispersion.

Inside a terrestrial data centre, AI clusters are connected by millions of parallel fiber strands, moving Terabytes across the pod to keep the GPUs fed. In space, firing lasers between satellites requires perfectly aligning optical transceivers across the void. As Google detailed in their November 2025 Project Suncatcher research paper, achieving terrestrial-level DWDM (Dense Wavelength-Division Multiplexing) bandwidth via space lasers is incredibly difficult because the optical signal disperses over distance.

To make this work for AI training without losing the signal, Google’s preprint paper modeled a notional 81-satellite cluster that couldn’t just float freely; the satellites had to fly in an incredibly tight, just 100 to 200 meters apart. Not only does this require constant, fuel-burning maneuvers to prevent the billion-dollar nodes from colliding, but you still have to get that data back to Earth. High-bandwidth laser downlinks are notoriously susceptible to atmospheric interference. If a thick cloud system parks itself over your ground station, your latency advantage instantly vanishes.

I know, I know. You went into training again. But even for inference, think of the time and bandwidth required to either move huge amounts of data from Earth into space, or between satellites. Depending on the use case, this transmission limitations may become a problem for day-to-day AI use.

Should we rethink the silicon then?

So, where does this leave us? Elon Musk is absolutely right about one fundamental thing: the energy bottleneck on Earth is a severe, existential threat to the scaling of AI.

What if instead of shipping current chips to space we try to architect a new computing paradigm that is not as power hungry as a general-purpose matrix multiplication accelerator as GPUs? I’ve been thinking about this for some time now with the emergence of companies like Extropic and their thermodynamic sampling units, all the work around photonic chips, and non-deterministic computing (astute readers will have noticed that I on purpose left quantum computing out of this bag… at least for now).

I think it is time for me and this newsletter to dig deeper into this new computing paradigm and what is possible to avoid having to ship Nvidia chips into space.

@adlrocha - Agents will make Code and Apps obsolete

adlrocha — Sun, 15 Feb 2026 08:52:53 GMT

Karpathy already predicted it in 2024, “English is the hottest programming language”. I have to admit that while I agreed with the claim then, I wouldn’t have expected that we would have come such a long way in a bit less that two years.

To me, and many others, the release of Opus 4.5 into Claude code was that moment where English really became the best way to interact with computers. You described what you wanted and Claude code would get you working code that did what you wanted.

But after some experiments I’ve been doing the past few days, I not only think that English is the hottest programming language, I also think that agents have made programming languages and applications obsolete. Let me try to explain to you why.

I am glad to be a mess

I’ve always been a mess managing my personal finances and investments. I’ve tried everything, and I’ve always failed at it.

Open-source and SaaS personal finance apps like homebank that I could use out-of-the-box.
Writing my own web app with a simple web interface that allowed me to track expenses and make updates over my portfolio. This is actually the one that led to better results for me, but the fact that I had to squeeze maintenance work for the app, even if minimal, along with my daily responsibilities resulted in me not doing it, and ending up abandoning the app (and thus the habit)
An excel sheet with monthly expense tracking and portfolio performance. This is what I am using now, but updating the sheet every month is taking me close to what took me to add the few changes and features needed for my app.. So I was back to square one.

But we live in the world of Baselight, LLMs, coding agents, and agentic social networks. There had to be a better way.

And not only had I found it, in the process I came to a realisation: Agents may be all we need!

Where it all started

It all started when I was reading a post from Lyn Alden’s newsletter where she was describing how the overnight financing markets were running into liquidity shortages, and that the Fed would likely begin balance sheet expansion in the not-too-distant future.

I then thought, “wow, it’d be cool if I could track these signals myself in some way, and this data must be public through the FED data”.

So I went into Baselight and asked the question that you see in the image below. Bingo! We had what I needed.

I then asked Baselight to make an analysis about recent events where the FED had to step in to inject emergency liquidity into the repo markets. It came up with this cool table and descriptive.

I was doing this analysis between diaper change and diaper change, while having in the back of my mind the urge to find a better way to handle my personal finances. Lyn Alden, Ray Dalio, and all these renowned macro investors back their investment decisions and portfolio allocation in a lot of macroeconomic data.

When it occurred to me. What if I use claude code to create a financial assistant that helps me make better investment decisions leveraging all the data that we have in Baselight?

That is how it all started, by connecting the Baselight MCP to Claude code. Since then, things started escalating quickly.

The Birth of Mr. Malone

After the first prompt I quickly realised that I could not only use Claude code connected to Baselight to help me make better financial decisions, I could also provide it with all my financials so that it could help me keep track of them, evaluate together my financial health, and make investment decisions using my own personal context instead of making them in the void.

I could even use it to track all my historical decisions so in order to help avoid recurrent mistakes (at some point I had a specific notebook for this, I don’t know where it is anymore. That’s how that went).

That’s how Mr. Malone was born

Note: Indeed, the name of the assistant, Mr. Malone, comes from Kevin Malone, one of the members of the accountant departments at Scranton’s Dunder Mufflin from The Office (the TV Show).

And here comes the realisation, why write the code for an assistant, if I could instruct claude code to become the assistant that I need.

Fine, Claude code is a coding agent but turns out it is general-purpose and customisable enough for you to instruct it to become any kind of specialised agent, provided the right context. No need for a single line of code. Only a descriptive enough CLAUDE.md to steer the “logic” of the agent.

Even more, all the persistent data storing the information about my expenses, assets, etc. (i.e. the memory of the agent) can live in a set of markdown files. No need for a cumbersome and expensive database infrastructure (or a deep technical background to run it). Many agents already use markdowns or JSONL to store their long-term memory and context. Why not use the same approach for the “database” of my agent?

Implementing agents using English language, a set of markdown files, and in my case an MCP connection to Baselight, has also several interesting consequences.

I can use git for version control and Github as the hosting service to sync it among all of my devices.
I am currently using Claude code as the “reasoning machine” for my agent, but I could use it with any other agent that supported CLAUDE.md (or AGENTS.md), MCP connections, and the tool calls that I needed (that at this point is basically calling Baselight, reading from files, writing to files, and in some exceptional cases web search. Nothing that most agents do not already support).
The assistant can run anywhere where Claude code (or other agents) run: your local device, a sandbox environment in the cloud… anywhere. And the dependencies are minimal: Claude Code. You pull the repo, you start the agent in that repo, and you are good to go.

I pasted above the tree of my agent. As you can see, the implementation is just a bunch markdown files with CLAUDE.md being the entrypoint of the agent’s behavior.

Files like portfolio.yaml, action_plans.md and decisions.yaml make up the memory of the agent, and other files like themes.md are used as a way to extend the core behavior of the agent to align with my most recent investment theses.

The operation of these extra files like themes.md is similar to that of Claude skills. For those curious about them, this is the structure that these files currently have.

Apart from these markdown files, the repo only includes a set of scripts used by git push and pull hooks to encrypt sensitive data before pushing it, and conveniently decrypting it when pulling the repo (all this lives in a private repo, but I am a bit paranoid about privacy. Even more after reading this tweet).

Finally, the queries directory include a set of pre-built SQL queries for the agent to pull the data from Baselight that it recurrently use in order to update “its view of the economy”.

Finally, docs include blog posts, articles and papers from different sources that I want my assistant to keep around and eventually review when making decisions.

Obviously, I not only did not write or execute a single line of code to implement my assistant, I also didn’t write any of these markdown files by hand. I may have edited a few sentences here and there, but the code was all generated by Claude code.

Even more, I was driving some of the features in the implementation with the first version of the agent already running, so Claude code was acting at the same time (and in the same session) as Mr. Malone and my coding agent. Mixing up the context of the same session for two very distinct purposes may not be the best idea, but I have to admit that the context size of modern models has become large enough not to experience any relevant regression in the agent’s output by doing this.

Mr. Malone In Action

Here’s an example run of Mr. Malone. It first performs a few Baselight queries to build its context, and then generates an output with the analysis. Of course, the analysis is way more extensive and includes a detailed analysis over my portfolio (but I’ll redact that for now, if you are interested about my allocation and want to discuss investments drop me a message).

I even have a few recipes baked into Mr.Malone to help me analyse specific economic themes and run recurrent queries in order to build the context with specific knowledge for a decision.

The result of my analysis or investment decisions may generate as an output an action plan that is stored in memory and taken into account for future decisions and agent interactions. I can easily review my action plans through the /action-plan recipe as you (see image below).

I could also go into the process of feeding portfolio and expense updates to Mr. Malone, but I don’t want to make this too long.

I am considering to, at some point, remove all private info, generalise the repo, and fully open-source Mr. Malone for everyone to benefit from it. Again, if this is in any way of your interest let me know.

Agents will make code and applications obsolete

This unplanned experiment with Mr.Malone made me realise something about the future of AI: it may not only make code but also applications obsolete.

I just needed an interface to an intelligent agent with enough customisation capabilities to get a computer run the logic that I needed. It all boiled down to:

A reasoning machine.
The right instructions to steer “the logic”
The right context and data sources (this is why Baselight is great :) )
The required (and ideally minimal) connections needed for its operation.

Many of you may counter-argue, “sure, you haven’t written (or executed) a single line of code, but you still need a terminal and the application is essentially a cli”.

But this is just a consequence of the environment (let’s call it runtime) where the reasoning machine that I am using is running. Soon, we will get the equivalent of claude code with a graphical user interface where one of its “tool calls” or “connections to the environment” is to depict graphical artifacts.

What the hell, Claude and ChatGPT already do that on their apps, and we make beautiful charts on Baselight, so it would be quite straightforward to add that to Claude Code or Mr. Malone. Actually, I may even be able to run Mr. Malone directly in ChatGPT and Claude out-of-the-box (I’ll try this and report back).

And then, why would one need to implement an application in the first place if it could have an agent do what they needed? Sure, LLMs are a bit unreliable, they hallucinate, and they are stochastic machines and there are things for which we need deterministic outputs. But LLMs are becoming smart enough that with the right tricks you can steer their behavior to be close to that of the deterministic logic of an application.

As I was working on Mr. Malone and writing this post, I couldn’t get out of my head this LLM OS image used by to describe his idea of Software 3.0.

LLM OSes are not only a reality now (we tend to call them agents now), but we are reaching a point where everything may be running as an LLM OS.

English really is the hottest programming language. And Mr. Malone is just the beginning.