@adlrocha - The Eval Problem: How to Test AI Agents When They Never Give the Same Answer Twice
A practical two-layer approach, with lessons from Baselight AI and other agents
I wanted to close this series of posts from the past few weeks on agent engineering with the top of mind problem of everyone that starts building an agent to deploy it in production, but that I don’t see nearly enough people talking about openly.
How do you test that an agent works as intended in the hands of users? From my conversation with other people that are running agents in productions, it feels like we still haven’t found the silver bullet just yet (at least to my knowledge, if this is not the case, please shout!).
Two weeks ago I went through the Claude Code. Last week I covered Hermes, opencode, pi, and how the open-source community was solving similar problems to the ones faced by the Claude Code team (and honestly, everyone building agents these days). In these posts I intentionally didn’t go deep into how these projects were testing themselves, because I felt this deserved a post of itself.
This topic has been nagging me since we released the first LLM-powered feature for Baselight, the SQL error assistant. How can you feel confident that you’ve tested the hell out of your system so that it will operate as intended in every possible case the moment you release it to users, considering the stochastic nature of LLMs? How do you catch regressions before your users do? How do you even define what “working as intended” means when the system’s outputs are stochastic?
This post is an attempt to to share all that we explored in the process of building Baselight, complementing it with all that I’ve learnt from the source code of the agents from the last few weeks, with the goal of shedding some light on the topic of agent evaluation (hopefully sparing you some research and a few experiment).
The problem is not just the stochasticity
When engineers try to test an AI agent for the first time, the stochasticity is what stops them (at least this happened to me). You write a test, it passes, you run it again, it fails, same input, different output. The result of the test is not necessarily wrong, but it just doesn’t match 1:1 the assertion that the code was expecting for the test. Traditional assertion-based testing breaks immediately. You can’t assert character-by-character what the agent will output for a specific input.
The first time I faced this problem myself was when testing the error assistant. I built a testing harness where each test case received an SQL query with errors, an error message, and the corrected SQL query. Given the input, the error assistant returned a corrected SQL query, and the result of the test’s expected query should match the one generated by the error assistant. Simple enough, right?
Not really. It was a fucking mess. For simple tests the correct result and underlying assertion to be made was obvious, and the result of the expectation always matched the one generated by the agent. But with more complex queries there was ambiguity caused by the error that resulted in a lot of flaky tests, with expected results sometimes not matching exactly the result of the generated query.
That’s bad, but it’s not the deepest problem.
The deeper one: before you can ask “did the agent succeed?”, you have to answer “what does success look like for this task?” For narrow, well-defined tasks like the case of the error assistant that’s tractable. A SQL query either runs without errors and the result matches the original user intent, or it doesn’t. A response either cites a number that appears in the data or it doesn’t. But as tasks get broader and the catalogue larger, “success” starts to blur. How can we, for instance, assess objectively if the Baselight agent succeeded in performing a data analysis for a specific topic? We are back to a similar problem to the one I presented in my auto-research post. It all boils down to choosing the right evaluation metric.
At Baselight, we have an AI agent that lets users query and explore large datasets through natural language. A response can leverage the right datasets for the analysis, perform the right set of queries, but draw the wrong conclusions from it. A response can correctly identify a relevant dataset but miss a more relevant one three entries down in the catalogue. A response can produce a valid chain of SQL queries that answers a slightly different question from the one the user actually asked. None of these are errors in the traditional sense. All of them are failures from the user’s perspective.
Defining what “good” means, precisely enough to check it automatically, is most of the work. Everything else that you can think of (tooling, frameworks, running the eval harness) is in service of that definition. You can leverage an off-the-shelf tool or build it yourself.
The two layers you need
From what I’ve seen in the open-source codebases and from building our own eval infrastructure, there are two distinct layers to testing an agentic system. Each one catches things the others miss.
Layer 1: Scaffolding unit tests. The parts of your agent system that are deterministic like prompt strings, compression logic, tool dispatch and their underlying logic, permission checks, CLIs, etc. These should be tested aggressively as ordinary software with the tools and techniques we are used to (from unit testing to end-to-end integrations).
Hermes’s most distinctive test file is test_prompt_builder.py, which imports guidance constants and asserts on their content directly:
def test_memory_guidance_discourages_task_logs(self):
assert “durable facts” in MEMORY_GUIDANCE
assert “Do NOT save task progress” in MEMORY_GUIDANCEThis is testing a prompt string the same way you’d test a function contract. If someone edits MEMORY_GUIDANCE and weakens the instruction, CI fails. The invariant, which we discussed last week as a real class of agent bugs, is now enforced by the test suite, not by convention. The compressor logic, the injection scanner, the deduplication rules: all of these are deterministic given their inputs. The fact that they serve an LLM doesn’t make them untestable and they can be tested like any other ordinary piece of software for the pre-LLM era, it is just deterministic code.
Something that I saw in Herme’s source code and that I personally liked to do before LLMs (but that I feel has become more important considering their stochastic nature) is issue-tagged regression tests. When a production failure happens, you write a test named after the issue before closing it. The test suite becomes a map of failures your system should never repeat.
A good example of this is Hermes’s tests/run_agent/ reads like a bug tracker: test_413_compression.py, test_1630_context_overflow_loop.py, test_860_dedup.py. Each file is a specific production failure, diagnosed, fixed, and locked in. test_413_compression.py mocks an OpenAI client returning HTTP 413, then asserts the agent compresses and retries rather than aborting. The whole test runs in milliseconds but encodes an invariant that future refactors can’t quietly break.
Layer 2: End-to-end evals with real model calls. This is where the hard problem lives. You run real prompts through the real pipeline, against real data, and score the outputs. No mocking the model, if there’s a bug in your context management or your tool composition, this is where it shows up.
The cost problem: you can’t run everything on every commit
Layer 2 is where the real cost hits. Running end-to-end evals means making real model calls, the same model your agent runs in production, with the same token budget per task. A suite of 50 eval cases, each taking 5-10 model calls to complete, can burn through thousands of tokens per run. Run that on every commit in CI and on every developer’s machine during local iteration and the bill adds up fast, before you’ve shipped a single feature (been there, done that).
The practical answer is the same one traditional software uses for slow integration tests: don’t run everything everywhere. Layer 1 scaffolding tests are fast and free, run those on every commit and in every dev environment. Layer 2 end-to-end evals are expensive and slow, so they need a different trigger. What we’ve landed on ourselves: Layer 2 runs on PRs that touch the agent’s code surface, e.g. the system prompt, tool definitions, context management, retrieval logic, etc.
And if the change only impacts a specific narrow agent with a specific surface, we try to limit the testing surface to the minimum. A commit that only touches the UI or a background job doesn’t need to burn tokens on a full eval run. The CI pipeline needs to know which files map to which test surfaces, which is a small upfront investment that pays for itself quickly.
The corollary is that your eval suite needs to be stratified. Some cases are fast and cheap enough to run broadly; others like the long-horizon tasks with 10+ tool calls and a 15-minute timeout, should only run pre-release, not on every PR. Tag your test cases by cost and surface, and let the CI configuration decide which tier to run based on what changed.
Defining success: what we actually measure in Baselight
Layer 2 is only as good as your success definition. At Baselight we broke this into three dimensions, each of which required a different measurement approach.
Search quality. The agent’s first job is finding the right dataset for the user’s question. This sounds like it should be easy to measure, either it found the right dataset or it didn’t, but in practice the catalogue is large, datasets overlap in what they cover, there are datasets with similar data, and “right” depends on the user’s intent in ways that aren’t always explicit in the query. We ended up with a catalogQuality scorer: an LLM judge that takes the user’s question, the datasets the agent surfaced, and a reference set of expected datasets, and scores whether the agent’s catalogue search was on target.
This is imperfect. The judge can be wrong. But the alternative, not measuring this at all, is worse, because for us search quality is key to get a great ouput. A good analysis with the wrong dataset (e.g. due to outdated data or from a non-authoritative source) is way worse than a response that admits it couldn’t find relevant data. When you lead with 70 thousand different datasets on various topics with community contributions you operate at a scale where search is a great percentage of the success of the result.
Query success rate. Does the SQL the agent generates actually run without errors, and does it return data? This is the most tractable of the three, it’s nearly deterministic (and one of the value-adds of Baselight, its ability to audit all the side-effects of the agent’s chain-of-thought). We track errorRate (fraction of tool calls that failed) and queryQuality (did the SQL actually answer the question?). The latter is still LLM-judged, but the former is a simple counter. In practice, errorRate going from 2% to 8% is often the first signal that something broke (a schema change, a model regression in SQL generation, a context window issue causing the agent to lose track of which table it was querying, etc.).
Factfulness. Can every claim in the response be traced back to data that actually appeared in the tool call results, rather than being generated from the model’s training weights? This is the hardest to measure. We have a dataQuality scorer that checks whether specific numbers and facts in the response appear in the tool results. It’s imperfect, the model can paraphrase or aggregate in ways that make tracing difficult, but it catches the obvious failure mode: the agent making up a statistic rather than fetching it.
None of these metrics are perfect. The useful ones catch the obvious failures reliably and the subtle ones directionally.
Existing Tooling
Once you have a success definition, you need infrastructure to run evals at scale. A few options worth knowing about.
Evalite is what we built on, a vitest-based framework by Matt Pocock that runs .eval.ts files the same way vitest runs .test.ts files. The model is clean: a data function providing test cases, a task function running the actual pipeline, and a scorers array scoring the output. Results persist to SQLite, a web UI runs at localhost:3006, and the whole thing integrates into a standard CI pipeline. The framework ships with built-in scorers (exactMatch, answerSimilarity, faithfulness, toolCallAccuracy) plus hooks for custom LLM-judge scorers. For us, the key constraint was running evals sequentially (maxConcurrency: 1) against a shared database, and the 15-minute per-test timeout to accommodate long agentic tasks. Off-the-shelf frameworks often assume fast, independent test cases, and that assumption breaks for agents (we may need to rethink currenting testing and CI infrastructure, but as I tend to do lately, that’s a topic for another day).
Braintrust is the most complete commercial option for teams that want offline eval plus production observability in a single platform. It connects the evaluation loop directly to production traces, you can spot a failure in a real user session and turn it into an eval case without leaving the tool. The pricing model is opaque, but for teams that want a “batteries-included” solution without building the infrastructure themselves, it’s the most serious option right now.
Langfuse is the open-source alternative, self-hostable, OpenTelemetry-compatible, with a strong tracing and monitoring story. If you’re already running PostHog for product analytics and want a tool focused specifically on LLM traces, Langfuse fills that gap cleanly. It doesn’t have Braintrust’s integrated eval loop, but it gives you full visibility into what the agent is doing in production.
For Baselight, we already have PostHog LLM Analytics in place with generation tracking, latency, cost, and trace visualisation. That covers the observability layer. What it doesn’t do is run systematic evals or score outputs; that’s what evalite handles. The two tools cover different parts of the problem.
There’s also DeepEval, which has thought carefully about the two-layer structure of agentic systems: the reasoning layer (the LLM’s planning and decision-making) and the action layer (tool execution). Their framework distinguishes between ToolCorrectnessMetric (did the agent select the right tool?) and ArgumentCorrectnessMetric (did it pass the right parameters? “Calling the right tool with wrong arguments is just as problematic as calling the wrong tool entirely.”) That framing is useful even if you’re not using their framework, because it forces you to attribute failures correctly: a bad SQL query might mean the model reasoned incorrectly, or it might mean it called the right tool with the wrong schema.
One thing that I think is key to understand, and the reason why you need a good production observability system is that your production conversations are your best source of new test cases. Every real user session is a data point about what queries your agent actually faces, which tool call sequences it takes, and where it goes wrong in ways your synthetic (or intuition) test cases didn’t anticipate.
This is what we used to grow from a few dozen tests to a more robust harness. It is important to have a systematic way of pulling failures and interesting edge cases from production traces into the eval suite. A query that broke in production, once understood and fixed, should become a regression test. A pattern of queries where the agent consistently takes four tool calls to do something that should take two should become a new scorer. Doing this at scale is a complete pain (and not a solved problem), and for us there’s still a lot of manual work, but we will get there.
This is where the observability layer connects back to the eval layer. PostHog traces tell you what happened. The eval suite tells you whether it was good. The loop between them is what keeps the eval suite from going stale. Without it, you’re testing against the problems you anticipated when you wrote the cases, not the problems your users are actually hitting.
Handling the stochasticity
Back to the problem that stops teams first. If you run the same prompt twice and get different outputs, what does “pass” mean?
Anthropic’s engineering blog on agent evaluation offers the clearest framework I’ve seen. They distinguish between two metrics:
pass@k: at least one of k attempts succeeds. Useful for capability questions, i.e. “can the agent do this at all?”
pass^k: all k attempts succeed. Useful for reliability questions, i.e. “can I deploy this to production and trust it won’t fail 40% of the time?”
These are different questions and they’re often conflated. A 70% per-trial success rate gives you a pass@3 of 97%, the agent almost certainly succeeds if you give it three tries, but pass^3 of 34%, which means it fails all three about a third of the time. Whether that’s acceptable depends entirely on whether your product gives users a retry button or treats the first response as final.
For broad analyses with a large dataset catalogue, we find that larger tasks also reveal model quality differences more clearly. With a narrow, well-specified task, weaker models still complete it most of the time. With a complex analysis requiring multiple tool calls, search across a large catalogue, and synthesised reasoning, the gap between a stronger and weaker model becomes visible quickly, and the degradation compounds step by step.
The practical upshot: run your eval suite 3-5 times per release and look at the distribution, not the point estimate. A 3% mean improvement that’s significant in a t-test is probably real. A 3% improvement on a single run is probably noise. Then the key question here gets back to defining success: can your product actually afford a pass^k lower than a 100%? But that’s a hard one to get. At least for us in Baselight, data analysis can run for several minutes, include large queries, and require several interactions. We can’t afford requiring users to make several attempts to solve their problem.
The open problem: the vibes test
Here’s the honest state of things.
We have layer 1 covered. Layer 2 is running and catching real regressions. But there’s still a stage in our release process that I call the vibes test: someone deploys to staging, runs a handful of queries manually (thank you, Michal :) ), and says “this feels right” or “something’s off.” It’s manual, it’s subjective, and it’s the only thing catching certain regressions that the automated suite misses.
I don’t love this. It doesn’t scale, it’s dependent on having someone with a good intuition for the product (like Michal), and it catches things inconsistently. But I haven’t found a way to fully automate it away for now.
What the vibes test is catching, I think, is the thing that Anthropic’s guidance calls “transcript review”, reading actual agent conversations to build intuition for failure modes. The automated scorers measure specific, pre-defined dimensions. The vibes test catches things outside those dimensions: a response that technically scores well on all five scorers but has a weird hedging pattern that suggests the model didn’t really understand the question. A tool call sequence that’s technically correct but takes twice as many steps as it should. A tone shift that suggests the system prompt got confused somewhere.
Making the vibes test smaller, by surfacing these patterns through automated scoring is still a work in progress for us. The most promising partial answer I’ve found comes from search systems and I read it from a post with this one already drafted.
Mercadona’s (the Spanish chain of supermarkets) deploy pipeline rejects models that degrade any of four metrics beyond -2%, with a one-hour hold before activation that allows a human abort. Mercadona, which processes 4.4 million queries a week through a hybrid search pipeline, maintains what they call a golden set: 500 manually annotated queries with correct answers, kept immutable and never updated from model outputs. The reason: if you refresh your golden set from production data, and that production data reflects what your current model returns, you eventually train the evaluator to rate your model’s style as correct rather than rating actual correctness. The eval contamination problem.
The practical upshot for agent evals: keep a frozen core of carefully curated cases that never gets touched, and add new cases from production to a separate, growing set. The frozen core is your regression baseline. The growing set is where you discover new failure modes. Applied to agent evals: automated scoring against the frozen golden set, with a threshold gate that blocks the release and pages someone when any scorer degrades beyond a defined bound. The person doing the vibes check then has a concrete baseline to objectively identify regressions.
For Baselight we were using a similar approach of having a small number of critical tests as the golden rule, but I think the approach from Mercadona is brilliant. I highly recommend everyone to read this post where they explain how they completely rewrote from scratch their search engine leverage coding agents (it’s in Spanish, but in the age of LLMs languages are no longer a concern).
Where this leaves us
The two-layer structure, scaffolding unit tests with issue-tagged regressions, and end-to-end evals, are the two dimensions for proper testing of agentic products. I think we are going to see more and more systems like this in CI and tooling to help with it surfacing in production.
We can already draw a lot of inspiration from what open-source agents and the research community is doing, but to me, the part that’s hardest to hand down is the success definition for your specific use cases (coming up with one for Baselight that we thought suited well the use case was a heated discussion).
I think we are all still learning, and vibe-testing your agents we stick around as practice for a while. According to benchmarks, Claude Opus 4.7 surpasses the capabilities of Opus 4.6, but “the vibes” from the community haven’t been so positive, to the point that many are defaulting to 4.6 still.
And with this I close this “improvised” three-part series on agent engineering. Next week I want to write about what all of this means for the engineers building these systems, and the future of our discipline because I really need to put some order to all of the ideas around the topic that I’ve been having lately, but that’ll have to wait
PS: I also l wanted to end this post with a special thank you note to Jonathan Tavares and the team at Singular for their invaluable support building Baselight AI and its evaluation harness (they definitely did all of the heavy-lifting here). Thank you!
Until next week!


