@adlrocha - The Model is still not the Product
What the agent loops of open-source projects tell us about the future of software engineering.
Last week’s post on the Claude Code leak surfaced something I keep coming back to: the most interesting engineering in a production AI agent isn’t the model, it’s everything around it. How context gets managed across a long session, how the memory is structured so the right details are brought to context when needed, how tools are designed to be composable and provide relevant functionality. As discussed last week, there’s a lot of engineering around the model to make it shine.
At first, I thought that all of these patterns and techniques were hidden behind Claude Code’s closed source code, but then I started asking myself (as I briefly alluded in my last post), there’s already a lot of interesting open-source projects that I use on a daily basis implementing agent loops, what are this open-source equivalents doing compared to Claude Code? The Claude Code source showed us how Anthropic solved these problems, but how do Hermes Agent, Pi, and Opencode approach these same problems?
I have this hypothesis that everyone is converging to similar solutions for the kind of problems we are currently facing when dealing with LLMs, but is this true? Can we start leveling up certain common patterns into “best practices”? To scratch this itch I spent the week digging into the source code of a few popular agentic projects (with obvious help of my bedside agents and LLMs :) ). I focused on projects that I liked and I used daily (so I also had some intuition of the features and how the techniques under the hood felt when used).
I’ll mainly focus on Nous Research’s Hermes (100,000 stars, #1 trending this month), pi from Mario Zechner (to me the most elegant, simplest, most efficient harness that I’ve read). I also dug into opencode from Anomaly, but didn’t see anything worth mentioning there. All of them are tackling some of the fundamental problems we all face in agent engineering (context, memory, skills, subagents) and each makes different bets. Some of those bets are worth borrowing regardless of which framework you use.
This is what I found.
Hermes: the architecture
I am going to use as the base for my analysis the codebase of Hermes, as its core agent loop is a bit more complicated than the rest. Hermes has taken the Internet by storm lately (at least in the bubble that I like to hang out in), and I feel they approached certain things in a different (and nice) way. Before talking about specific techniques, it helps to understand how Hermes is organised.
The codebase splits cleanly into six layers:
prompt_builder.py is stateless, a pure function that assembles the system prompt from pieces: identity (or
SOUL.mdif the user has one), memory guidance, skills index, context files.hermes.md → AGENTS.md → CLAUDE.md → .cursorrules, first match wins), and model-specific steering blocks.run_agent.py is the main loop, it calls the prompt builder, runs the LLM, dispatches tool calls. context_compressor.py is a fully self-contained class with its own LLM client, decoupled from the main agent. Compression is triggered by the loop but runs independently.
memory_manager.py coordinates one built-in memory provider and at most one external plugin; the cap is deliberate, to prevent tool schema bloat.
skill_manager_tool.py handles everything skill-related: creation, editing, patching, deletion.
delegate_tool.py manages subagents.
On top of all of this sits a gateway layer, a separate process that routes Telegram, Discord, Slack, WhatsApp, Signal, and email messages into the same agent loop.
When you read agent codebases that feel fragile, it’s usually because everything is tangled together in one giant run loop (been there, done that). Hermes keeps compression, memory, skills, and subagents as distinct modules with clear interfaces. Each can be understood, tested, and replaced without touching the others, leaving the model to orchestrate them as needed.
The self-authoring skill system
The part everyone talks about is Hermes’ self-writing skills. Now it has become common knowledge, but when it was first announced I thought it was a pretty smart way of implementing some kind of continuous learning into the agent. It was a way around the hundred of thousand lines of code that OpoenClaw had to implement to handle the integration of external systems.
Here’s what it actually looks like in code.The trigger is a single instruction in the system prompt. The exact text from prompt_builder.py:
“After completing a complex task (5+ tool calls), fixing a tricky error, or discovering a non-trivial workflow, save the approach as a skill with skill_manage so you can reuse it next time.”
No background daemon. No scheduler. The threshold is five tool calls, and the model decides when it has crossed it. When it does, skill_manager_tool.py runs a careful creation pipeline:
name validation →
frontmatter validation (YAML with name and description required) →
size check (max 100,000 chars) →
name collision check → directory creation →
atomic write (temp file, then os.replace()) →
security scanWith full rollback if anything fails.
The skill index is injected into every system prompt with a two-layer cache: an in-process LRU dict keyed by skills directory, toolset, platform, and disabled list, backed by a disk snapshot validated by mtime/size manifest for cold starts. When a new skill is created, the cache is immediately cleared so the next prompt rebuild sees it.
When loading skills, the header reads: “Err on the side of loading, it is always better to have context you don’t need than to miss critical steps, pitfalls, or established workflows.” And after difficult tasks: “If a skill you loaded was missing steps, had wrong commands, or needed pitfalls you discovered, update it before finishing.”
That last instruction I think is key for this system to actually work and scale. The agent doesn’t just create skills, it’s expected to maintain them. The patch action in skill_manage uses a fuzzy match engine (the same one used for file edits) that handles whitespace differences, indentation variance, and block-anchor matching. The agent can self-correct a skill it wrote three sessions ago without needing an exact string match. That’s beautiful, because it also solves potential breaking changes into the tools the skill interacts with.
Hermes ships a few hundred SKILL.md files by default: for GitHub PR workflows, Obsidian, Linear, Google Workspace, MLOps, home automation. The long tail of “somebody already figured this out.” It felt a bit bloated to me when I first installed it, but with this the agent can write its own code and maintain it when it needs one of these features instead of having to figure out something from scratch, or having to explicitly ship code that performs that logic.
The learning from this implementation? Something that already alluded to a few weeks back in this post. Instead of writing code with the specific logic required for some task, we can encode procedural knowledge as first-class, versioned, agent-editable artefacts. Skills that the agent itself can create, patch, and deprecate based on what it learns at runtime. We are making software adaptable and fungible (as we will discuss in future posts).
This feature has a clear trade-off that I’ve experienced myself when writing a similar feature for my agents’ harness: skill explosion. Without some system that identifies skills that are semantically equivalent, you run the risk of having several skills written by the agent that do the same thing. For this feature to work at scale, you need some kind of garbage collection process that does semantic deduplication and purges deprecated skills (again, from personal experience).
Some of these design decisions also feel quite inefficient in the use of tokens and the model’s underlying context, but I have to admit that I haven’t had the time to properly benchmark this (I run Hermes against a local model, so I am not that worried about token costs). I’ll let others that have gone through that exercise add to this.
Compressing context without losing the plot
As a conversation grows, tool outputs, file reads, and back-and-forth exchanges pile up until you hit the model’s token limit. When that happens, the agent would normally just fail or lose history (and from my experience this gets really nasty with local models).
Context_compressor.py prevents that by shrinking the middle of the conversation, preserving the system prompt and recent work (the “head and tail”), and replacing everything in between with a structured summary that captures what matters: the current task, decisions made, files touched, what’s blocked, what’s still pending. In short: it’s the mechanism that lets Hermes run indefinitely across a long task without losing its memory of what happened, while keeping the active context small and focused.
It runs in four phases.
Phase 1 is a cheap pre-pass with no LLM call. Old tool outputs are replaced with one-line summaries: [terminal] ran npm test → exit 0, 47 lines. Identical tool results from repeated reads of the same file are deduplicated. Tool arguments are truncated, but JSON-aware: the compressor parses the argument JSON, shrinks long string values to 200 characters, and re-serialises. An earlier implementation just sliced the raw string, which produced broken JSON and caused provider-specific 400 errors. The JSON-aware version came from running into that failure in production. You’ve probably seen this summaries in Hermes’ outputs depending on the level of verbosity that you have configured in your agent
Phase 2 is boundary detection. The head (system prompt + first few exchanges) and a token-budget tail (around 20% of context, containing the most recent work) are protected. Everything in between is the compression target. The boundary aligns to avoid splitting tool call/result pairs, a split there would leave orphaned IDs that break downstream execution.
Phase 3 is the LLM summarisation with a structured template: Active Task, Goal, Constraints, Completed Actions, Active State, In Progress, Blocked, Key Decisions, Resolved Questions, Pending User Asks, Relevant Files, Remaining Work, Critical Context. Not a freeform summary. A structured handoff document.
Phase 4 re-inserts the summary and cleans up orphaned pairs. On subsequent compression, the system does iterative updates, patching the existing summary rather than re-summarising from scratch. There’s an anti-thrashing guard: if the last two passes each save less than 10% of tokens, compression is skipped.
Compaction techniques are a topic in themself that I’ll cover in future posts (as always, drop me an email if you want me to prioritise it). In the meantime, I’ll leave you here with a nice write-up on the topic from the factory.ai team
Subagent isolation
delegate_tool.py is the piece responsible for isolating agents so they don’t pollute each other when they are doing their thing. That way the main agent’s loop can trigger isolated research agents in the background that will (eventually) provide an output result to the parent
Each child agent gets a fresh conversation with no parent history, its own terminal session, and a toolset that is the intersection of the parent’s enabled tools and the explicitly requested tools. Children cannot gain capabilities the parent doesn’t have. skip_memory=True, skip_context_files=True, i.e. children start clean. The toolset that is hardcoded-blocked for children: delegate_task (no recursive delegation beyond depth 2), clarify (children cannot ask the user questions), memory (no writes to the shared MEMORY.md), send_message (no cross-platform side effects), execute_code.
The recursion depth is hardcoded at MAX_DEPTH = 2 deliberately to keep the execution graph tractable.
Parallel batch mode uses a ThreadPoolExecutor with a configurable cap (default 3). So we have a structured concurrency architecture. The interrupt propagation is explicit: if the parent is interrupted, the executor stops waiting and collects whatever children have finished. The tool name global is saved before child construction (which overwrites a process-global variable) and restored after. A clean isolation detail that prevents a subtle bug where the parent’s available tools appear changed after delegation returns.
The heartbeat is the detail I wasn’t expecting. A daemon thread runs every 30 seconds, touching the parent agent’s last-activity timestamp so the gateway’s inactivity timeout doesn’t fire while a subagent is doing real work. The parent appears idle during delegation, no messages, no tool calls — and without the heartbeat, long-running subagents would get their sessions killed.
Many of you may have realised by now, that a lot of the techniques for parallel programming are directly applicable to parallel agents architectures (structured concurrency, process isolation, etc.).
Model-aware prompt steering
One thing Hermes does that I haven’t seen written about elsewhere but that I guess we are all doing (at least we use a similar approach for the multi-model support of Baselight AI): it serves different system prompt instructions depending on which model is being used.
GPT and Codex models get XML blocks called <tool_persistence>, <mandatory_tool_use>, <act_dont_ask>, <prerequisite_checks>, <verification>, and <missing_context>. These are explicit behaviour constraints: “never answer arithmetic from memory, always use a tool,” “if a question has an obvious default interpretation, act on it rather than asking.“ Gemini and Gemma models get different guidance: absolute paths always, dependency checks before importing, parallel tool calls where possible, non-interactive CLI flags. GPT-5 and Codex get the developer role instead of system, because newer OpenAI models give stronger instruction-following weight to that role.
Unlike Claude Code, where all the engineering behind it was adapted to Anthropic’s models, these multi-model tools need to adapt their operation to the models supported under the hood if they want it to perform well with all of them.
The code comment on this section: “Inspired by patterns from OpenAI’s GPT-5.4 prompting guide & OpenClaw PR #38953.” The framework is actively borrowing from other codebases and naming the source. This is a field converging on shared solutions in public, in code.
If you’ve tried to implement a multi-model agent this may be obvious to you, but for those of you that haven’t, the same agent instruction doesn’t work equally well across all models. If you’re building a multi-model agent, the steering layer needs to be model-aware if you want good performance with all of them. What reads as “obvious” to Claude requires an explicit XML constraint for GPT.
Memory: declarative over imperative
One design decision in prompt_builder.py that seems small but has large runtime consequences is the memory guidance:
“Write memories as declarative facts, not instructions to yourself. ‘User prefers concise responses’ ✓ — ‘Always respond concisely’ ✗. ‘Project uses pytest with xdist’ ✓ — ‘Run tests with pytest -n 4’ ✗. Imperative phrasing gets re-read as a directive in later sessions and can cause repeated work or override the user’s current request. Procedures and workflows belong in skills, not memory.”
The distinction prevents a real class of bugs. An imperative memory written in session one (”always run tests before committing”) gets re-read as a live instruction in session five and overrides what the user actually asked for. Declarative facts don’t have this problem because they describe state rather than commanding behaviour.
What we can learn from this is that one should always separate their agent’s persistent knowledge (declarative) from its persistent procedures (skills). Conflating them produces agents that become increasingly hard to steer as their memories accumulate.
pi: the extension event bus
As I mentioned above, to me pi is one of the simplest and most elegant implementations of an agent loop that I’ve seen. pi’s bet is that agent frameworks shouldn’t bake features into the core loop, they should expose lifecycle events and let extensions intercept them. The ExtensionAPI exposes 30+ typed events. I have to admit that my distributed systems background (and how much I love configuring my agents to my personal workflows) loved this. This is why it has become my go-to coding agent since a few weeks ago.
To really understand how powerful this is, you just have to follow a prompt through the system and watch the events that it fires and the side-effects triggered. Take memory management, for example. When a conversation gets too long, pi needs to compress it. But right before that happens, the session_before_compact hook fires. With a few lines of TypeScript, you can intercept this event and tell the framework to swap out your expensive, primary model for a cheaper, faster one just for the summarisation step. If anything fails, it gracefully falls back to the default. The framework didn’t need to build a “custom summarisation model” feature; it just provided the hook, letting you implement the policy.
That same philosophy applies to safety and context. Before the AI executes any command, the tool_call event fires, pausing time with a mutable input. This is how you build a bouncer for your terminal: if the AI tries to run a destructive command like rm -rf, your extension can read the command, block the execution, and throw up a confirmation dialogue. Because the event is mutable, earlier handlers can even patch arguments before later ones see them.
Beyond these hooks, pi extends its capabilities using Skills, which follow the open agentskills.io standard. These are formatted as XML in the system prompt, pinpointed by an absolute file path. But the smartest part of the skills system is its governance. Powerful skills usually eat up a massive amount of context, which can bog down your agent. pi solves this with a simple disable-model-invocation flag. This hides the skill from the prompt entirely, keeping your AI lightweight until you explicitly summon that specific tool by typing /skill:name. So if you don’t want to increase the capabilities of your agent but you don’t want to implement your own logic triggered by hooks, you can always create your own skills.
Ultimately, the overarching pattern here is what makes the project so elegant to me: hook-based extension over baked-in features. It keeps the core system incredibly small and fast (i.e. easy to read and understand) while leaving the surface area for customisation wide open. There is, of course, a trade-off. To wield this kind of power, you may need to understand the internals, the specific events triggered by the event loop, and configure it to be in-par in terms of features with other coding agents like OpenCode or agent harnesses like Hermes.
But hey, to me it has been a charm to use and tinker with. Even its defaults are already yielding great results for me. Just give the pi-agent-core README a quick read if you want to immediately fall in love with this project :) (see image below).
Where this is heading
After reading the source code for all of these projects, a few things stand out to me. First, I’ve reinforced my idea that one way or another, we are all facing the same kind of agents, and with different approaches and trade-offs (in many cases depending on the use case) we are all arriving at similar solutions.
Even more, the shared substrate is also converging. The SKILL.md format, the agentskills.io open standard that pi implements, the <tool_persistence> XML blocks for GPT, the OpenAI interface that has become the de-facto API for models, these are becoming common across frameworks. A skill written for Hermes can be adapted for opencode or pi with minor changes. As the ecosystem matures there won’t be a single framework winning, but shared primitives that work across all of them. A lot of the underlying pieces for agent engineering are slowly being standardised.
Which links with another point I’ve been making for a while: that agents will make code and apps obsolete as we know them. A few markdown files (potentially in the form of skills and memory files), an agentic loop, an LLM model as the reasoning machine a.k.a processor, and a knowledge base (in the form of external integrations, memory, context, etc.) enable the creation of software that adapts to specific contexts and use cases.
There’s a project called Matrix OS that takes this trajectory to its logical conclusion. The framing: “The LLM is the CPU. The context window is RAM. Files are files.” The agent manages the filesystem, spawns sub-processes, and creates new capabilities by writing new agent definition files and tools at runtime. It can modify itself. All state persists as files, synced peer-to-peer via git. I feel this is the high-level architecture that we are all converging to, and the platform targeted by the “new software engineering” (more on this in a few weeks).
I would place Hermes, pi, and opencode into different categories of agents (while close I don’t think they address the same use case). But they all follow a similar direction, and they could all be adapted and configured to solve a lot of different use cases. We are moving into a world of “fungible software” (man, my backlog is exploding with ideas that I want to share in future posts…).
What does all of this mean for software engineering as a profession? That’s the question I want to address in the coming weeks. Specifically what happens to the people building software, not just the people building agents. My take right now: software engineering is changing A LOT (as we are all feeling), but I don’t think the jobs are going anywhere in the short-term.
It will involve a massive reskilling though. Engineers who understand these systems at the level we’ve been discussing this week are building leverage that compounds. Skill authoring, context governance, subagent isolation, hook-based extension, behavioural prompt engineering, are disciplines that will outlive any specific framework.
If you’re already building agents and have been thinking about how engineering patterns are evolving, I’d like to know what I missed. If you want to contribute to this discussion please drop me an email, leave a comment, or reach out directly.
Until next week!




