Continuous monitoring and evaluation
AI models do not degrade linearly. They degrade in steps. A small distribution shift in production data triggers a sudden drop in accuracy. An OCR model at 99% today sits at 92% in two weeks if a new invoice template gains adoption. Monitoring is not an auxiliary activity. It is part of the core governance loop.
Context
Section titled “Context”We run two monitoring modes in parallel. A real-time stream watches every call. A scheduled set of audits compares today to a stable baseline. Together they catch sudden failures and slow drift before either reaches customers.
Real-time monitoring
Section titled “Real-time monitoring”Runs 24/7:
- Every LLM call is logged through the observability layer. Prompt, model version, temperature, latency, token usage, output, reasoning trace.
- Automatic alerts fire when latency exceeds the P99 baseline, when error rate exceeds 2%, when token usage spikes, or when the confidence distribution shifts more than 20%.
- 1% of outputs are sampled through LLM-as-a-Judge and scored live for Accuracy, Groundedness, and Safety.
Scheduled monitoring
Section titled “Scheduled monitoring”- Daily. KPI scan: STP rate, extraction accuracy, hallucination rate, cost per transaction.
- Weekly. Drift review: PSI (Population Stability Index) of input features against baseline. PSI > 0.1 is a drift warning. PSI > 0.25 is severe and triggers retraining.
- Monthly. Fairness audit: a fixed test set runs across SME and Enterprise cohorts to verify bias is not creeping in.
- Quarterly. End-to-end risk re-classification: every production AI feature returns to Step 1 of the six-step risk framework.
Alert ownership
Section titled “Alert ownership”Every alert has an owner and a response SLA:
- Squad Steward. Technical alerts (latency, error rate, drift). 1-hour SLA during business hours.
- CoE on-call. Alerts crossing squad boundaries or carrying security suspicion. 30-minute SLA, 24/7.
- DevSecOps. Security alerts (prompt injection patterns, denial-of-wallet). 15-minute SLA, 24/7.
Reporting
Section titled “Reporting”Monitoring data rolls up into three reports:
- Squad weekly review. Operational KPIs, drift score, top errors.
- CoE monthly review. Cross-squad trends, drift across model families, fairness audit results.
- Board quarterly review. Strategic patterns, top risks, KPI versus target.
Enterprise customers receive a quarterly transparency report with platform-wide operational KPIs and an incident summary.