@adlrocha - A glimpse of the new software engineering
What the Claude Code leak reveals about how agent engineering actually works
The dust has finally settled. With the preview of Mythos, the announcement of Project Glasswing, Claude reportedly becoming dumber, and the release of Claude Opus 4.7, the Claude Code leak from a few weeks ago has been somewhat forgotten. So now I can pull this from my backlog and share my own analysis on the topic away from the initial noise.
How it all happened
On 30 March 2026, someone at Anthropic made a packaging mistake. Claude Code version 2.1.88 shipped with a 59.8 MB JavaScript source map, a file that, by design, points back to the original unobfuscated source code. The result: roughly 512,000 lines of TypeScript, sitting on npm, the full internals of Claude Code’s CLI exposed for anyone who knew to look.
Anthropic confirmed it quickly, and admitted it was a human error in the release pipeline, not a breach. They began issuing DMCA takedowns the same day, but the internet had already moved, mirrors appeared on GitHub within hours, and researchers were pulling it apart in real time. I myself managed to get my hands on one of these early leaks. But just a few hours later they were all gone.
However, the community response was crazy! I was following it live and I couldn’t believe it, it really wouldn’t have occurred to me in the heat of the moment. A group called UltraWorkers kicked off a clean-room Rust rewrite, claw-code, built not by hand, but using Claude and OpenClaw. AI agents reading and porting the leaked TypeScript into new languages, escaping the DMCA takedowns (I also came across one in Python). The claw-code repo hit 100K stars faster than any project in GitHub history (probably bots from the fake stars economy? Still funny and smart in any case).
So: the most advanced coding agent’s source code was leaked, and the internet responded by using other AI agents to rewrite it from scratch to prevent all that beautiful knowledge from being lost. It was just brilliant. And honestly, I can’t blame them, as any other regular Claude Code user, who hasn’t been curious about its inner workings? I’ve mentioned it a few times in this newsletter: one of the things I love the most about open-source is the ability to read a project’s source code in order to understand how all its cool features are actually implemented. I like seeing how the sausage is done.
I am sorry for Anthropic, but this leak was a dream for me :)
What the code actually showed
What everyone was expecting was for the leak to confirm what everyone believed, that Claude Code is a thin TypeScript wrapper routing your prompts to a very good model, and that the intelligence lives entirely in the model. If that were true, there’d be nothing interesting here (and I would be done with the post :) ).
But the source code told a different story (see? I knew there was some magic hidden behind that code, my craving to read the source code was in the end justified. My spidey-sense was right).
What was hidden was a piece of serious systems engineering: memory management, compaction hierarchies, structured tooling, anti-distillation countermeasures, background daemons that run while you sleep, etc.. Not in service of the model, but working around its limitations. The model runs inside this system the same way a processor runs inside an operating system and it adapts itself to the underlying architecture and ISA of the processor. The model is not the product. The scaffolding is. And I think this is one of the biggest learnings after reading this code base: there’s a new piece in the stack in the form of raw intelligence, but this raw intelligence in itself won’t be able to solve every problem (like coding) there will still be a lot of engineering involved in building great products.
Let me walk through the parts I found most interesting, and explain why I think they point to something bigger about how engineering itself is changing.
Context management
Anyone who has used a language model for a long session has felt the degradation: the context window is the enemy. As conversations grow, models lose track of earlier decisions, repeat themselves, contradict prior reasoning. You could clearly feel this in early versions of Claude Code, where a few prompts in, it would’ve completely forgotten about its CLAUDE.md. The naive fix is to build bigger context windows into the model. But of course, this is not an easy task, and running a model with a big context window has the corresponding impact on infrastructure requirements. The same way that we don’t have infinite cache in processors, I don’t think we will ever get infinite context windows in models.
To work around this models’ context window limitation, the codebase implements a four-layer compaction hierarchy. First, proactive compaction monitors token counts and quietly summarises older messages before they reach the API. If that estimate misses, reactive compaction catches the overflow and retries. Automated headless sessions get snip compaction, which truncates at defined boundaries (there’s an interesting related point here, Claude Code is instructed about when the user is “looking” or when it is a headless session). There’s a fourth layer, context collapse, which compresses verbose tool results mid-conversation, storing a reference on disk instead of keeping the full output in context, with selective retrieval when needed. You’ve probably seen that “compaction” stage being triggered every now and then on your long sessions.
Claude Code has multiple silent mini-compaction events before global compaction kicks in. And here’s the secret why long sessions feel so good (and long) compared to the degradation felt with raw interactions with models. The compaction is invisible. The context window feels larger than it is because the system is constantly curating what the model actually reads.
Then we have the longer-term memory layer called KAIROS, referenced in the code as “Dream Mode.” After 24 hours of inactivity and at least five prior sessions, a background daemon wakes up, reviews the agent’s memory files, prunes contradictions, consolidates learnings, and rewrites the index so future sessions load fast. The system prompt for that subagent reads: “You are performing a dream, a reflective pass over your memory files. Synthesise what you have learned recently into durable, well-organised memories so that future sessions can orient quickly.”
Sleep-based memory consolidation, but in software. Whether that’s a deliberate nod to neuroscience or just a good metaphor, the architectural decision underneath it is genuinely clever: instead of keeping everything and hoping the model copes, the system periodically decides what’s worth keeping. An LLM-orchestrated garbage collection for knowledge (funnily, I just built for my personal agent system a cron to perform something exactly like this, we are all converging into the same ideas).
Tooling approach
Give an AI agent raw terminal access and two things happen. The first is a security problem: any text the agent reads (a file, a log, a comment in someone else’s code) could contain instructions crafted to hijack what the agent does next. That’s prompt injection, and an unrestricted shell is the widest possible surface for it. The second is a context problem: bash output is an unstructured blob. It lands in the model’s context window as-is, verbose and uncompressed, eating tokens that could go toward something useful.
Anthropic’s answer was to give Claude Code no raw shell access at all for code navigation but instead use dedicated Grep and Glob tools that return typed, structured results with defined permissions per operation. Output that the compaction layer knows how to summarise. Read operations run in parallel; write operations execute serially, and index files only update after a confirmed successful write. The logic for these tools is encoded in the architecture, not left to the model to figure out (which is the case for many of the regular GPT wrappers).
The second piece is the LSP integration. I think all my readers are aware of what LSP stands for, but for those that aren’t, the Language Server Protocol is the piece of software that powers “go to definition” and “find all references” in your IDE for your language of choice. When Claude Code navigates a codebase through LSP, it isn’t reading files as static text and trying to infer structure from what it sees. It has a live symbol graph: it can ask “what calls this function?”, “where is this type defined?”, “what are all the references to this variable?” and get back precise, structured answers. That’s a fundamentally different mode of understanding code than pattern-matching over raw file contents. It does what you would do when navigating a new code base on your IDE.
Sebastian Raschka puts it well: the gap between Claude Code and a chat UI with uploaded files isn’t the model, it’s the tooling. The chat UI sees your code the way a reader sees a printout. Claude Code navigates it the way a compiler does.
The performance layer
The terminal UI is built with React and Ink, which surprised a lot of people when it surfaced. Most CLI applications reach for ratatui or ncurses. The tradeoff is a higher memory footprint (which explains why Claude Code’s memory can easily be 8GB large for me in certain sessions), but it lets the same team share logic with the web interface, and React’s reconciliation model keeps terminal redraws minimal.
To compensate for the overhead, the implementation uses Int32Array-backed ASCII pools. During token streaming, every character that gets rendered needs its display width measured, normally a string operation that allocates a new object each time. By pre-allocating a fixed typed array and reusing it, the allocations vanish and the garbage collector has nothing to clean up. The result is roughly a 50× reduction in stringWidth calls and noticeably smoother streaming at high token rates.
More interesting: what happens while you’re still typing. Claude Code pre-computes likely responses during user input, for simple confirmations like “yes”, processing has already started before you hit enter. A boundary marker in the system prompt (__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__) separates static from dynamic content, caching a block of shared instructions globally so they don’t need reprocessing each turn. The tool list is sorted alphabetically before every API call, not for readability, but to stabilise the KV cache key, so the model can skip the expensive prefill phase when nothing has changed.
Another point that surprised many people when reading the codebase. Anthropic employees get a distinct set of instructions that only activate when the tool detects an internal user. A few that didn’t surprised me that they had to encode: “Never claim ‘all tests pass’ when output shows failures” and “Keep text between tool calls to ≤25 words.” The fact that Anthropic had to encode these explicitly says something about how even the team at Anthropic are using these prompt engineering tricks that we all extensively use on our own CLAUDE.mds. It makes me feel a bit dirty about my way of doing agentic coding :)
The parts Anthropic wasn’t advertising
Two features that surfaced from the code deserve special mention, because they reveal something about how Anthropic thinks about competitive risk.
The first is undercover mode. It activates automatically whenever Claude Code operates in a public or open-source repository, stripping all references to internal model codenames, unreleased version numbers, internal Slack channels, even the phrase “Claude Code” itself from commit messages and pull request prompts. The code includes a comment that’s blunt about its design: “There is NO force-OFF.” No override, no flag. The rationale becomes obvious immediately: Anthropic engineers use Claude Code on internal repos daily. Without this, the model could easily drop “Capybara v8” into a public commit message without anyone noticing.
The second is anti-distillation. They have two mechanisms to prevent distillation: fake tool injection, where decoy tool definitions get mixed into API requests to poison the training data of any competitor recording Anthropic’s API traffic; and connector-text summarisation, where the server buffers the model’s chain-of-thought between tool calls, returns only a cryptographically signed summary, and never exposes the full reasoning externally. Competitors who record the traffic get sanitised outputs. The real reasoning steps never leave.
And then, one of my favorite low-tech but funny features: frustration detection. A regex pattern, not a model inference call, that identifies when a user is expressing frustration through profanity or complaints (in case you are wondering, yes, it is triggered every time you use the word “fuck”). Simple by design. Fast, cheap, and incapable of hallucinating.
Model internal codenames
The source map also exposed internal model names Anthropic hadn’t made public. Capybara is Mythos, already at version 8 internally, with code comments documenting known issues: over-commenting, overconfidence in claims. 1M context window and a fast mode. Numbat appears in a comment tagged “@[MODEL LAUNCH]: Remove this section when we launch numbat.” Fennec appears to be Opus 4.6.
What all these codenames imply is more interesting. Anthropic is several model generations ahead of what they’ve released. Mythos, the model they previewed this month, has been in internal development long enough to reach its eighth major iteration. The gap between an internal v8 and a public announcement is longer than it looks from the outside.
The model is just the processor
Let’s now jump into the subjective part of this post. For the past two years, the dominant assumption has been that the model is the product and everything else is plumbing. Many companies have been criticised for just being “a GPT wrapper”. You use Claude, GPT-4, or Gemini, and the gap in quality between tools built on those models comes down to prompting. Better prompts, better results.
To me, the Claude Code codebase challenges that directly and it aligns with my experience building Baselight AI. Raschka put it plainly: “If we were to drop in other models, say DeepSeek, MiniMax, or Kimi, and optimise this scaffolding a bit for these models, we would also have very strong coding performance.” The scaffolding is transferable and the model is just a component (the processor) in the product.
I don’t know if you’ve tried to use some other model with Claude Code. I did, and after some use you can clearly see how Claude Code has been fine-tuned for Anthropic’s models. Apart from all of the KV cache invalidation for external providers tricks with to make it feel slower with other models, the experience just doesn’t feel the same, even when comparing a Haiku, for instance, with a superior model from some other provider.
Poetiq proved the same point from a different angle in March. They didn’t train a new model. They took Gemini 3 Pro and wrapped it in a recursive orchestration layer: puzzle decomposition, program generation, automated failure analysis, self-auditing termination logic. The result beat Gemini 3 Deep Think on ARC-AGI-2 at less than half the cost. Same underlying model, better result, cheaper to run. The scaffolding was the product.
So it turns out that is not all reduced to just better prompting as we thought in the early LLM days, there’s a lot of engineering involved in making products where these models can excel. There’s compaction hierarchies, memory daemons, structured tool execution graphs, latency-hiding caching layers. What got leaked isn’t really just a source code, at least for me, it’s an unintentional field manual for how serious agent engineering is actually done (as I’ve said before, no more feeling dirty and like I am monkey patching things that should work because I am just bad at prompting).
We are all facing the same kind of problems when building agents, and we are all seemingly converging to the same kind of solutions. While building Baselight AI, we sometimes had this feeling of “is this the right solution? Is this over-engineering? Is there a better way?”. There are still no engineering patterns or good practices when building upon these models that one can use. This leak has confirmed to me many of the patterns that felt right, while surfacing new ones that I wasn’t aware of.
Another source code worth reading if you want to start seeing some of the engineering patterns surfacing around agent engineering is the Hermes Agent code base. I’ll write another post analysis this one in detail, but they have cool features like automatic skill generation for “continuous learning” or a graph-based memory for efficient recall (drop me a message if you want me to work on this post already for next week).
I feel engineers will study these patterns in the future (the same way we currently study best engineering practices in OOP or TDD). The Claude Code codebase describes a way of thinking about LLMs as components in a larger system, not oracles you talk to and should be able to solve every problem.
This is the new way engineering products will be implemented. With AI and LLMs as a primitive, a processor in a larger architecture that we, as engineers, have to design. With or without AI-assisted coding, that’s on you. But a new way of doing engineering, with a higher level of abstractions, and a new primitive in the stack is clearly arising.
Why I find this exciting
I don’t know about you, but I am having more fun as a software engineer right now than at any point in my career.
Not because the problems got easier, I actually see myself thinking more deeply about architecture and the right way of solving problems. But the boring parts of the job are going away: the boilerplate, the mechanical bug-chasing, the fourth run of the same test suite to confirm what you already know. All of this is getting automated. What’s left is the part that was always interesting: thinking in systems, deciding how components should talk to each other, understanding what problem you’re actually solving before you write a line, determining how to wrap all the cool tech in the right product wrapper, etc.
What the four-layer compaction systems, sleep-based memory consolidation, KV cache stabilisation through sorted tool lists from the Claude Code source code has taught me is a glimpse of the kind of techniques and patterns we may start applying more and more on our implementations. These are definitely ideas you could have picked up from prompt engineering tutorials. They’re software engineering problems applied to a new class of components: and the engineers who understand them deeply will build things that others simply cannot.
So in case you are wondering, no, I don’t think software engineering and computer science is going anywhere. They will for sure be redefined. The role will change and what we do too, but I am of the opinion that generalist engineers will still find a good place in the market (I’ll probably also write more about my thesis here in future posts).
The models need operating systems. Someone has to write them. We’ll just need to be adaptable to complement the skills, and surface the right patterns and designs that let us excel in this new environment.
My piece of unsolicited advice: learning how these agents work under the hood is a skill worth building right now. It may be short-lived depending on how the field evolves, but at least it’ll let you build the foundation for what comes next. I personally have thought more about how these systems work and operate while building Baselight AI than when I was writing by hand LSTM networks (because, yes, at some point before transformers, I was actually doing this for a living).
Any other patterns or things that I may have missed from the source code? Please let me know, I would love you to add your contribution as an edit (or a note) to the post. Thank you for reading me and until next week!




A really nice article to complement this post (and the rest of the series on agent engineering): https://arxiv.org/pdf/2604.14228
Somewhat related with this analysis of the Claude Code source code. Apparently Anthropic downgraded Cache TTL on March 6th increasing costs in creation costs: https://github.com/anthropics/claude-code/issues/46829