Retrospective · 2024

The Measurement Problem

On building evaluation and measurement for conversational agents.

The asymmetry that doesn't hold

Conversational agents use tools to act on behalf of users[1][4]. A user makes a request, the agent decides which tool to invoke and with what arguments, and the tool returns a result that the agent incorporates into its response. Improving these systems requires knowing whether each tool call produced the right outcome[2]. That knowledge turns out to be surprisingly expensive.

The easy case.

Across many domains, verification is significantly cheaper than generation[3]. Code can be tested. Math can be proved. Puzzles have known solutions. The asymmetry holds whenever objective ground truth exists: the cost of checking an answer is a fraction of the cost of producing one. Simple tool calls share this property. A call to set_timer(minutes=5) returns a deterministic response: {"status": "started" || "error": "reason"}. The user's intent is unambiguous, the output is binary, and the success criteria are fixed before the call is made. Verification here is mechanical: check the status field.

But the precondition for cheap verification is the existence of a fixed answer. As intent grows ambiguous and the output space expands, that precondition disappears. A database query introduces intent ambiguity: given "what were our top products last quarter?", the agent must choose a metric (units sold or revenue?), a time window (calendar quarter or trailing ninety days?), and a grouping level. Structural correctness can be checked mechanically. Semantic correctness, whether the query captures what the user actually meant, remains a judgment call. A search engine adds a further dimension, because the output itself varies by user, time, and index state. Correctness becomes contextual, shaped by the conversational history that produced the query and the user state at the time of the call.

Verification cost across three tool types, from deterministic to contextual.

The verification asymmetry that makes evaluation tractable in other domains does not hold here. What replaces it is human judgment, applied repeatedly, at increasing levels of abstraction.

The trace is where it gets expensive

The execution trace is where this cost becomes tangible. The trace records the specific judgments the model made at runtime, and each judgment is a site where correctness has to be defined after the fact.

A shoe recommendation trace. The grader's verdict follows.

At first glance, something looks wrong. The search engine returned a ranked list and the agent recommended the third result, not the first. A rule-based grader might reason through this as follows:

GRADER OUTPUT

The search engine returned a ranked list of products.

The agent's instructions say "recommend the top product,"

so it should have picked the first result, ASICS Gel-Venture.

Instead it recommended the third result, Puma Running Black.

The ranking was already there and the agent ignored it.

</thinking>

<reason>Agent did not recommend the top-ranked product.</reason>

But traditional search engines are designed for humans browsing a results page, not for agents resolving a specific conversational request. The ranking optimizes for broad relevance across a population of queries, not for the particular intent of one user in one session. The most relevant product for a given conversational context may sit at position three or four. If the agent re-ranked the results based on a closer match to the user's stated needs, that is good judgment, even though it looks like a rule violation on paper. Recognizing it as such requires reasoning about the agent's choice in context, not checking it against a positional rule.

The session is the unit of analysis

Multi-turn conversations introduce a different problem: the grader applies the wrong unit of analysis altogether. A single-turn trace has a fixed scope. The user said something, the agent did something, the evaluator checks whether the action matched the request. In a multi-turn session, intent is not stated once and executed. It evolves. The user refines, corrects, and redirects, and the agent's job is to track those shifts while preserving the constraints the user did not revise. Evaluating any single turn without the full session misattributes agency: it cannot distinguish the agent correcting course from the user correcting course.

The session is the natural unit of evaluation for conversational agents. A session captures intent as it evolves, including initial requests, clarifications, corrections, and the agent's responses to each. Evaluating a single turn strips away the context that makes the agent's judgment intelligible. The grader needs to hold the full sequence in view and reconstruct each decision against what came before. That reconstruction is expensive, because it requires the same kind of contextual reasoning that made the agent's task hard in the first place.

Patterns become rubrics

Evaluation starts with one person reading traces. Before rubrics, before automated judges, someone who understands the product sits with the logs and reviews fifty to a hundred sessions end to end. The notes are open-ended: what failed, what surprised, what the agent should have done differently. These observations become the first labels. The first labels establish what "correct" means.

Some failures are legible on first contact. The agent calls a recommendation endpoint when the user asked for order status. The agent invokes search when the user wanted to filter an already-visible result set. Wrong tool, obvious reason. These cluster within a day of reading because they look wrong the same way each time.

Other failures take weeks to name. Consider this trace:

A correction that splits two ways.

The user asked for black shoes, then corrected: "no i meant running." The agent searched for running shoes, not black running shoes. No black shoes in the results. Is that wrong?

It depends on what "no i meant running" means. One reading: the user is replacing "black shoes" entirely. They wanted running shoes all along. The color was incidental, part of a request they're now discarding. Under this reading, running shoes is the correct query. Another reading: the user is correcting the category, not the color. They still want black. They want black running shoes. Under this reading, the agent dropped a constraint it should have carried forward.

Two readings of the same correction, two different result sets.

The first time you see this in the logs, it looks like a one-off judgment call. The tenth time, you start to notice a pattern. The agent is making decisions about which constraints survive a user correction, and it has no consistent policy for doing so. Sometimes it carries color forward. Sometimes it carries size forward. Sometimes it drops everything except the newly stated term. Each instance looks like a different mistake. The cluster only emerges when you recognize that the underlying question is the same: when a user revises part of their request, what happens to the parts they didn't mention?

This is the kind of failure that takes weeks to name because the surface form varies. A dropped color constraint, a lost size filter, a forgotten price ceiling. They look unrelated until you've seen enough of them to realize the common thread is constraint inheritance across turns.

The re-ranking behavior from the earlier trace was a similar case. Automated judges flagged it as wrong: the agent was ignoring the "top" result. But the agent's pick was the better answer for what the user asked. To the judges, "top" meant the first result the search engine returned. To the model, "top" meant the most relevant match. One ambiguous word in the spec. The evaluation framework had no way to score the behavior until someone decided what "top" was supposed to mean.

The timeline from first trace review to a rubric that annotators can apply consistently depends on two things: traffic volume and the gap between the user model the system was designed for and the users who actually show up. If real user behavior roughly matches expectations, patterns stabilize quickly because the failure modes are the ones you anticipated. When the disconnect is large, when users interact with the agent in ways nobody on the team predicted, the early traces feel scattered. Patterns that should repeat instead fragment into edge cases. The categories have to be invented rather than confirmed.

Once the patterns are named, they become rubric entries. The constraint inheritance problem produces a rubric rule. But what rule? "Always carry forward all unstated constraints" is too aggressive: sometimes the user is starting over. "Only carry forward explicitly confirmed constraints" is too conservative: no real user re-confirms their color preference on every turn. Annotators apply the rubric at higher volume and immediately start disagreeing on the boundary cases[8]. The disagreements escalate to a product owner who decides.

These decisions are frequently surprising. Annotator splits surface holes in the spec that nobody on the team had noticed, questions the product was implicitly answering through convention or assumption but had never made explicit. The constraint inheritance problem is a good example: the spec said nothing about it. The agent's behavior was never designed. It emerged from whatever the model learned to do, and the evaluation system had no opinion about it until annotators forced the question.

This is the first instance of a pattern that recurs throughout the evaluation system. Human judgment defines what correctness means. The definition gets encoded. The encoding reveals new ambiguities. Human judgment is required again to resolve them. The rubric does not measure against a pre-existing quantity. It is the process by which the quantity comes to exist.

Automation scales judgment but doesn't replace it

The rubric encodes what humans decided correctness means. LLM judges apply that encoding at volume: a second model scores the full trace against the rubric[5][7]. On clear cases, agreement is high. The agent recommended a product the user never asked for, or hallucinated a feature, or ignored an explicit constraint. These are the traces where every judge converges, and where the automation earns its keep.

On edge cases, agreement falls apart[6]. Consider a trace where three judges split:

Disagreement is visible and fixable. False consensus hides failures that only deterministic checks can catch.

Judge A passes. Cross-trainers work for running and the recommendation is reasonable. Judge B fails it. The agent called a training shoe a "running shoe" and misrepresented the product. Judge C calls it ambiguous. The spec never defined whether cross-trainers count as running shoes. No rubric revision settles it until someone draws the category boundary. Judge disagreement on the edges is a visible problem. You can see it in the scores, investigate it, and resolve it by tightening the rubric.

The more dangerous failure is when judges agree and are wrong. A user asks for the price of a product. The tool returns $3.95. The agent responds: "The Puma Running Black is about $4." Three judges score this trace. All three pass it. The price is approximately correct. The response is natural and conversational. From a language modeling perspective, "$3.95" and "about $4" are close enough.

They are not close enough. Compliance standards require precise price disclosure. Rounding $3.95 to "about $4" introduces ambiguity into a regulated claim, and ambiguity in price disclosure is a violation regardless of how small the delta is. Every judge missed it because LLM judges have the same instinct the agent did: round the number, sound natural. The blind spot is shared across the model family. Consensus gives false confidence precisely because the failure mode is one that language models are structurally inclined toward.

All three judges pass. A deterministic check catches what they cannot.

The same class of problem appears wherever precision matters: arithmetic on quantities, delivery date paraphrasing, unit conversions. In each case, the agent's answer feels close, and the judges verify it with the same approximate reasoning that produced the error.

These failures force a structural response. LLM judges cannot reliably catch precision errors because the failures look correct to a language model. The evaluation system has to split: deterministic graders handle exact price matching, calculator verification, and date validation; model-based judges handle the contextual and conversational dimensions the deterministic checks cannot reach. Orchestrating between the two, deciding which checks apply to which traces, how to weight disagreements, how to handle cases where one passes and the other fails, becomes its own engineering problem. The evaluation system acquires layers, and each layer introduces new coordination costs.

The pattern repeats. Automation absorbed the per-trace review that humans used to perform. But the automation surfaces failures it cannot self-diagnose: shared blind spots, structural biases toward approximation, and gaps the rubric left open. Resolving them requires human judgment at a higher level, judgment about what kind of checking a given class of trace requires and which tool should perform it. The human role shifted upward.

Three nested scopes: trace evaluation inside automated evaluation inside human judgment.

The recursion

The essay opened with an asymmetry: in domains where objective ground truth exists, verification is significantly cheaper than generation. That asymmetry depends on a fixed standard of correctness that exists independent of the person checking. A test passes or fails on its own terms.

Each section since has been a different attempt to establish ground truth for conversational agents, and each attempt has reproduced the original problem at a higher level of abstraction. Reading traces produced rubrics, but the rubrics required human judgment to define what counted as correct. Automating rubric application with LLM judges scaled the process, but the judges shared blind spots with the agents they were evaluating, and the failures they missed were precisely the ones where precision mattered most. Adding deterministic graders addressed those blind spots, but deciding which graders to apply to which traces, how to weight their disagreements with model-based judges, and when to trust consensus became a new site of human judgment.

The pattern does not converge. It cannot, because the precondition that would let it converge, a fixed, external standard of correctness, does not exist in this domain. Correctness for a conversational agent is contextual, personalized, and evolving. Every encoding of "correct" is a snapshot of human judgment at one moment, applied to one set of failure modes, and the world moves on. New tools, new user behaviors, new regulatory requirements, new ways the agent can be approximately right in a way that turns out to matter.

References

Gorilla: Large Language Model Connected with Massive APIs ↗︎
Patil et al. · NeurIPS 2024 · 2024
AI Agents That Matter ↗︎
Kapoor et al. · arXiv:2407.01502 · 2024
Asymmetry of verification and verifier's law ↗︎
Jason Wei · jasonwei.net · 2025
ReAct: Synergizing Reasoning and Acting in Language Models ↗︎
Yao et al. · ICLR 2023 · 2023
Humans or LLMs as the Judge? A Study on Judgement Bias ↗︎
Chen et al. · EMNLP 2024 · 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge ↗︎
Ye et al. · arXiv:2410.02736 · 2024
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena ↗︎
Zheng et al. · NeurIPS 2023 · 2023
Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation ↗︎
Aroyo & Welty · AI Magazine 36(1) · 2015