KPIs and monitoring
We measure AI success on four mandatory KPIs and three operational metrics. Every one is computed continuously through the observability layer. Reported weekly to the CoE and quarterly to the Board. An unmeasured KPI is an uncommitted KPI.
Context
Section titled “Context”A KPI list exists to make trade-offs explicit. STP rate and Extraction Accuracy push us to ship more automation. Hallucination Rate and Cost per Transaction push us back when automation gets unsafe or uneconomic. The four together keep the system honest.
Four mandatory KPIs
Section titled “Four mandatory KPIs”Straight-Through Processing (STP) rate. Share of documents AI processes end-to-end with no human touch. The leading indicator for business value. Computed on a rolling 30-day window.
Extraction accuracy. Measured per field. Critical fields (tax ID, total, partner code, invoice date) exceed 99%. Secondary fields exceed 95%.
Hallucination rate. How often AI generates content not supported by the source. Measured by sampling 1% of output through LLM-as-a-Judge plus a weekly human-graded ground-truth set.
Cost per transaction. API token cost plus infrastructure cost per processed document. This KPI trends down as models are optimized and volume scales.
Three operational KPIs
Section titled “Three operational KPIs”Reporting cadence
Section titled “Reporting cadence”KPIs roll up at three levels:
- Squad weekly review. Feature-level KPIs owned by the squad.
- CoE monthly review. Cross-squad trends, platform-level KPIs.
- Board quarterly review. Strategic KPIs and material movements.
Enterprise customers receive a quarterly transparency report with platform KPIs and an incident summary. See §13 Reporting.
Alert thresholds and escalation
Section titled “Alert thresholds and escalation”| KPI | Warning (yellow) | Critical (red) | Action |
|---|---|---|---|
| STP rate | < 80% over 7 days | < 75% over 7 days | Squad investigates. CoE escalation if red |
| Extraction accuracy | < 98% on critical fields | < 95% | Pause new releases. Trigger model retrain |
| Hallucination rate | > 3% sampled | > 5% | Raise sampling to 5%. Consider model rollback |
| Cost per transaction | Up > 20% QoQ | Up > 40% QoQ | Engineering review. Prompt and model optimization |
| Uptime | < 99% in 24h | < 95% in 1h | Trigger SEV2 or SEV1 incident |
| Data drift | PSI > 0.1 | PSI > 0.25 | Schedule retrain in current sprint |
| HITL latency | > 60s median | > 120s median | UX review. Likely a UI issue |