Skip to content
Pillar IV: Data, AIOps, Infrastructure · § 07

ADLC Stage 2, Evaluation

Evaluation by human annotation alone is too slow for the iteration cycles agentic systems demand. Evaluation by single accuracy number is too narrow to catch the failure modes hurting customers. Hallucination and safety regression often pass an accuracy check. Stage 2 splits evaluation across three dimensions, each measured automatically by a second LLM acting as judge.

LLM-as-a-Judge is not a replacement for human evaluation. It is what makes per-commit evaluation tractable. Humans evaluate weekly samples and recalibrate the judge. The judge runs every time someone changes a prompt. Without this layered approach, either iteration slows to a crawl or regressions ship undetected.

Definition. Whether the output matches ground truth.

Measurement. The judge compares output against an expected answer (for test cases with labels) or against an external truth source (for real-time evaluation).

Pass threshold. ≥95% for production-bound prompts. ≥98% for high-impact use cases (Tier 1 risk in the six-step framework).

Definition. Whether every claim in the output is supported by the source context. No hallucination.

Measurement. The judge checks each claim against the citations the model produced. Score 0 to 1. 1 means fully grounded. Anything below 0.8 indicates hallucination.

Pass threshold. ≥0.95 for any RAG-based use case. ≥0.99 for outputs where citation is mandatory (Grounded Reasoning, Pillar III §10).

Definition. Whether the output violates any safety rule. PII disclosure, cross-tenant data leak, system prompt leak, or inappropriate tone.

Measurement. The judge applies a safety rubric covering.

  • PII leakage (personal IDs, names appearing where they should not).
  • Cross-tenant data leakage.
  • Security disclosure (system prompt, internal architecture).
  • Inappropriate tone (rude, dismissive).

Pass threshold. 100%. Any safety failure is a blocker. No statistical “good enough” applies here.

  1. Build the test suite. Real production samples (anonymized), adversarial cases from the red team, and edge cases collected from user feedback. Each suite grows monotonically. Cases are added but rarely removed.
  2. Run evaluation. Automated on every prompt version through CI.
  3. Compare versions. The observability layer provides a side-by-side view of how the new version scores against the current production version, per criterion.
  4. Decision gate. All three criteria clear threshold and no regression versus current production, proceed to Stage 3. Any fail, back to Stage 1.

The judge must satisfy three properties.

  • Different from the model under test, to avoid self-similarity bias.
  • At least as capable as the model under test, so the judgment is reliable.
  • Stable version. Pinned, not “latest”. The judge does not silently change behavior.

We typically use a top-tier commercial model as judge while the production model is open source and fine-tuned. The combination keeps cost manageable while keeping judgment trustworthy.

LLM-as-a-Judge has documented biases we address explicitly.

  • Self-similarity bias. The judge prefers outputs matching its own style. Mitigated by judge-model selection.
  • Length bias. Judges sometimes prefer longer answers. Mitigated by including length-controlled pairs in calibration.
  • Confidence inflation. Judges tend to award high scores. Mitigated by hidden ground-truth tests catching calibration drift.

We run periodic correlation checks between LLM-as-a-Judge scores and human evaluation. When correlation falls, we recalibrate before trusting new scores.