Why LLM evaluation metrics look stable but customers are unhappy

LLM systems can pass their regular evaluation checks while degrading in ways users immediately notice. Aggregate accuracy and judge scores remain stable because they measure isolated outputs, not the effort required to complete a task.

The gap is structural: classic metrics evaluate answers, while users experience interactions.

What stable metrics hide

Three mechanisms let traditional evaluation report green while production degrades: benchmark saturation, judge bias, and averaging effects.

Benchmark saturation

When top models cluster at 90–95%, the leaderboard measures who optimized harder for the test distribution, not who reasons better. Models memorize test sets during pretraining: for example, GPT-4 (GPT-4o) drops from 88% to 73% on contamination-free Massive Multitask Language Understanding benchmark variants — the gap reveals how much reported accuracy reflects memorization rather than reasoning.

In production, this matters because agent workflows chain model calls. A model scoring 95% on isolated steps compounds errors across a 10-step chain: 0.95^10 ≈ 0.60. The system passes every component test but fails the end-to-end task.

LLM-as-Judge bias

Automated judges exhibit systematic biases: longer responses score higher regardless of accuracy, models favor outputs from their own family. The risk: if these judges are used in training loops, models may optimize for scoring preferences rather than user needs. High automated-eval scores stop predicting user satisfaction.

Averaging effects

Aggregate metrics dilute signal: perplexity averages likelihood across tokens. Easily predicted non-critical tokens can dominate the score, while errors on answer-bearing tokens remain underweighted. Benchmarks average over query difficulty — easy cases hide collapse on the complex edge cases users actually encounter.

High aggregate scores coexist with catastrophic failures on the subset of queries that define user trust.

What production systems measure instead

Production evaluation replaces output scoring with three signal categories: behavioral friction, system-level drift, and agent trajectory completion.

Behavioral metrics — friction signals

User actions often reveal dissatisfaction earlier than explicit ratings.

Edit rate — how often users modify generated outputs; high rates signal missed intent even when text is grammatically correct
Regeneration frequency — first-attempt failure rate
Turn count — more back-and-forth needed to complete the same task signals degradation

Monitoring these signals reduces the need to score every output. Signal-based sampling — classifying interactions by misalignment, stagnation, and disengagement — surfaces the sessions that need human review without running LLM-as-judge on every request.

System-level reliability — drift signals

Provider updates, prompt changes, or traffic shifts can degrade systems even when component metrics stay stable. Models change tone, add preambles, or shift output format — breaking downstream parsers, prompts that expect specific structure, or agent chains tuned to previous behavior. Production systems track output length variance, format compliance, and semantic drift, alerting when distribution shifts exceed baseline thresholds.

Agent trajectory evaluation — task completion signals

Component accuracy misleads in multi-step settings. Errors compound across steps — a model that passes each skill test in isolation can fail substantially when autonomously chaining those skills end-to-end.

Success metrics must track trajectory cost: an agent taking 6 steps and 5 tool calls for a 4-step task reveals brute-force trial-and-error, not fluent reasoning. Eval datasets that mirror production noise — incomplete logs, ambiguous alerts, missing context — produce lower pass rates but better predict real-world failures.

Evaluation is not about scoring outputs. It is about detecting when the model stops solving user problems — even if it still passes tests.