The Measurement Problem

On building evaluation and measurement infrastructure for agents that call external services.

Asymmetry of verification

Tool-use or function calling[1][2] guide routing decisions for Alexa+[3]. When a user makes a request, the agentic system must determine which domain to activate, which capabilities to invoke, which APIs to select, and how to fill the parameters. Evaluating this is easy. There is a ground truth: a correct tool and a correct set of parameters. Each call can be binary-scored against a reference[4][5].

Free-form string parameters break this. When parameter types can be constrained to strict ENUMs, integers, or booleans, they should be. It makes validation deterministic and kills entire categories of evaluation problems. But search queries are unavoidably strings, and a string cannot be scored by comparing it to a reference. Its correctness depends on what the search engine returns at the exact moment of execution.

Most useful tasks have a favorable asymmetry: generation is hard, verification is cheap[6]. Search inverts this. Generating a query is trivial: one LLM inference to reformulate natural language into a string. Verifying whether that query returned good results means executing the full retrieval pipeline and inspecting what comes back. All the cost is on the verification side.

The data flywheel that drives improvement in standard tool routing grinds to a halt.

Traces as the source of truth

Instead of scoring the action taken (scoring the search query), one has to start scoring the state of the world after the action is taken.

A typical shopping journey is not a zero-shot retrieval task. Users in discovery mode figure out what they want by looking at what they don't want, refining over multiple turns. A user asks for black shoes, gets dress shoes, then clarifies "for running."

If you evaluate that refinement turn in isolation, it looks like a failure. The user issued a correction, which usually implies the agent messed up. But it didn't. The first intent was executed correctly. The reformulation was a natural step in discovery.

The question then becomes how you evaluate a single decision point within a multi-turn trace without re-running the entire session. Consider what a trace actually looks like.

Example trace payload (turn t):

{
  "turn": 2,
  "user_utterance": "No, I meant for running.",
  "agent_action": "searchApi",
  "generated_query": "black running shoes",
  "a9_raw_payload": [
    {"asin": "B08FX...", "title": "ASICS Gel-Venture", "category": "running"},
    {"asin": "B07XY...", "title": "Nike Revolution", "category": "running"}
  ],
  "agent_response_state": "surfaced_items_1_and_2"
}

The evaluator looks at this and asks: given the turn 1 context and the exact payload at turn 2, did the agent refine the constraint accurately, or did it hallucinate?

In standard software, you debug by reading the code. In agentic workflows, the code only holds the prompt and the tool definitions[7]. The execution trace is your only source of truth as you cannot evaluate without the full multi-turn trace.

Evaluating the traces

LLM judges are the obvious tool in order to evaluate traces at scale. And their known problems (positional bias[8], shared blind spots across model families[9], hallucinated relevance judgments[10]) have standard solutions: ensembles across model families, majority voting, positional shuffling with confidence thresholds[11].

Here is an example:

User: "I need some fall boots"
→ searchApi("fall boots")
← [Timberland Waterproof Ankle Sneaker, Columbia Rain Shield Low Boot]

Judge A: pass (waterproof, appropriate for fall)
Judge B: fail (these are sneakers, not boots)
Judge C: ambiguous

When judges persistently split, the problem is almost never model bias or mistakes in the prompt. It is an ontological dispute in the product catalog. There is no ground truth for whether waterproof sneakers count as fall boots. Hence, a better judge or a more precise rubric is not the right resolution.

Evaluation must look at the user's behavior downstream: Did they engage with the results? Did they abandon the session? Did it lead to a purchase?

The ontology can be ambiguous. The user's actions are not. Outcome signals are user-specific and moment-specific which shapes what kind of evaluation datasets you can build.

There is no universal golden dataset

Building a universal dataset is impossible because the tools your agent calls are not static oracles.

When the agent fires a query into an external service like A9[12], it acts on behalf of a logged-in user. A9 optimizes results based on credentials, purchase history, locale, time of day. The "best" results for black running shoes look completely different for a frequent marathon runner than for someone who previously bought casual sneakers.

There is no golden result for any given search tool call. Only results-for-this-user-at-this-moment.

What is the agent's job once A9 returns those personalized results? Pass them straight through? If your trace shows A9 returned ten items but the agent surfaced three, your evaluator needs to know why.

When the model's constraints conflict with what A9 returned: should the LLM trust personalization even when it contradicts the conversation? Should it filter aggressively or surface everything and let the user decide? When a user asks for "running shoes under $80" and A9 returns a $95 pair ranked first because of purchase history, does the agent suppress it or show it with a note?

These are design choices that must be settled before you can write evaluation criteria, because the evaluator needs to know what "correct" looks like.

Synthesize people instead of queries

Production data is red. Traces may contain PII, payment context, exact locations, deeply private behavioral signals. You cannot pipe raw production logs into an evaluation dataset. You also have a cold-start problem: how do you evaluate a new tool before it goes to production if you have no traces for it?

As you will need complex multi-turn traces and cannot use the real ones, synthetic data generation is the only way out[13][14].

Ask an LLM to "generate 100 queries for shoes." The result is generic, grammatically perfect, and tests nothing but real users are fragmented, impatient, contradictory. Stop synthesizing queries. Synthesize people[15].

Here's an example:

"You are a 28-year-old amateur badminton player living in Seattle. You play weekly on indoor courts, you have a strict budget of $120, and you heavily prefer Yonex or ASICS. You are impatient and use short, fragmented sentences."

The agent processes the simulated user's requests, makes routing decisions, and calls a shadow search endpoint (a staging replica returning real product data outside production). Seed the synthetic profiles with fabricated order histories and simulated clickstream sessions to trigger the personalization layer.

The simulated user applies persona-specific constraints ("these are over $120") and reformulates. This plays out over several turns: a clean multi-turn trace, completely sterile from a privacy perspective, structurally complex enough to mimic a real shopping journey.

Synthetic regression/benchmarks rot the moment you deploy them but one needs to force entropy by programmatically sampling from the edges of a behavioral taxonomy covering budget, linguistic patterns, browsing style and domain expertise. To close the loop, monitor anonymized traffic distributions and feed that signal back into the generation parameters. Regression sets is a living system.

The loop

Something breaks in production. Root-cause it, synthesize data that replicates the failure pattern, iterate on the agent while the judge ensemble scores each variant, then validate against the full regression set accumulated from every previous cycle. Track defect rates per turn and per conversation to learn where in a session things go wrong, not just how often.

The hard part is the thing we build to tell us whether the loop is working.

If agents run millions of times a day, each turn produces a decision that no simple comparison can verify. So we build an evaluation system to call models, interpret traces and make judgment calls under ambiguity. It is itself agentic, and everything downstream trusts it. Deployments get gated on the judge consensus score which is a proxy for the production defect rate. The proxy breaks quietly when judges drift and regressions are shipped disguised as improvements.

The monitor to watch the evaluator for drift faces exactly the same questions we built the evaluator to answer. How do we know it is right?

The altitude of human judgment

All of the infrastructure, the traces, the synthetic generators, the judge ensemble, the regression pipeline, exists to compress one question into something answerable: did this change make things better or worse? And at every level where the system cannot decide for itself, the answer comes from someone who understands the product well enough to say what "correct" means. Which judge disagreements reveal a real gap in the spec and which ones are noise; how the behavioral taxonomies should evolve so the synthetic data does not collapse into a narrow distribution; whether the numbers moved because something actually changed or because the evaluation apparatus shifted under our feet.

Human judgment has moved from evaluating every query to designing the system that evaluates, calibrating the system that judges, and interpreting the signals that the whole apparatus produces. Every layer of automation we add just changes the altitude at which we, humans, have to think.

References

  1. AI Agents That Matter ↗︎
    Kapoor et al. · arXiv:2407.01502 · 2024
  2. Asymmetry of verification and verifier's law ↗︎
    Jason Wei · jasonwei.net · 2025
  3. A Survey on LLM-as-a-Judge ↗︎
    Gu et al. · arXiv:2411.15594 · 2024
  4. Amazon Search: The Joy of Ranking Products ↗︎
    Sorokina & Cantu-Paz · SIGIR 2016 Industry Track
  5. Scaling Synthetic Data Creation with 1,000,000,000 Personas ↗︎
    Chan et al. · Tencent AI Lab · arXiv:2406.20094 · 2024