Skip to content
Pillar I: AI Organization · § 09

Continuous monitoring and evaluation

AI models do not degrade linearly. They degrade in steps. A small distribution shift in production data triggers a sudden drop in accuracy. An OCR model at 99% today sits at 92% in two weeks if a new invoice template gains adoption. Monitoring is not an auxiliary activity. It is part of the core governance loop.

We run two monitoring modes in parallel. A real-time stream watches every call. A scheduled set of audits compares today to a stable baseline. Together they catch sudden failures and slow drift before either reaches customers.

Runs 24/7:

  • Every LLM call is logged through the observability layer. Prompt, model version, temperature, latency, token usage, output, reasoning trace.
  • Automatic alerts fire when latency exceeds the P99 baseline, when error rate exceeds 2%, when token usage spikes, or when the confidence distribution shifts more than 20%.
  • 1% of outputs are sampled through LLM-as-a-Judge and scored live for Accuracy, Groundedness, and Safety.
  • Daily. KPI scan: STP rate, extraction accuracy, hallucination rate, cost per transaction.
  • Weekly. Drift review: PSI (Population Stability Index) of input features against baseline. PSI > 0.1 is a drift warning. PSI > 0.25 is severe and triggers retraining.
  • Monthly. Fairness audit: a fixed test set runs across SME and Enterprise cohorts to verify bias is not creeping in.
  • Quarterly. End-to-end risk re-classification: every production AI feature returns to Step 1 of the six-step risk framework.

Every alert has an owner and a response SLA:

  • Squad Steward. Technical alerts (latency, error rate, drift). 1-hour SLA during business hours.
  • CoE on-call. Alerts crossing squad boundaries or carrying security suspicion. 30-minute SLA, 24/7.
  • DevSecOps. Security alerts (prompt injection patterns, denial-of-wallet). 15-minute SLA, 24/7.

Monitoring data rolls up into three reports:

  • Squad weekly review. Operational KPIs, drift score, top errors.
  • CoE monthly review. Cross-squad trends, drift across model families, fairness audit results.
  • Board quarterly review. Strategic patterns, top risks, KPI versus target.

Enterprise customers receive a quarterly transparency report with platform-wide operational KPIs and an incident summary.