AI incident management
When an AI incident hits, two things determine the outcome. How fast we contain it and how clearly we explain it afterwards. Bizzi runs a four-tier severity model, a four-phase playbook, and a fixed PIR template. The response is deterministic instead of improvised.
Context
Section titled “Context”The riskiest moment in any incident is the first hour, when the team is still deciding what happened. We pre-decide that. Severity rules are written down. The kill-switch is pre-wired. The disclosure SLA is fixed. The team’s job is to execute the playbook, not to invent it under pressure.
Severity tiers
Section titled “Severity tiers”| Tier | Definition | AP-automation example | Response SLA | Communication |
|---|---|---|---|---|
| SEV1 | Core function lost or sensitive data leaked | AI Chat returns tenant A’s data to tenant B. All AI features down | < 15 min | Customer Success immediately. Board within 1 hour |
| SEV2 | Major degradation or risk of spread | Extraction accuracy drops below 90% on a common template. Kill-switch triggered | < 1 hour | Affected customers within 4 hours |
| SEV3 | Localized fault with workaround | Hallucination rate up on one use case. Latency spike in one region | < 1 business day | Squad lead handles. Included in weekly report |
| SEV4 | Cosmetic | Confidence color renders wrong on one browser | < 1 week | Backlog |
Four-phase playbook
Section titled “Four-phase playbook”Phase 1. Detect
Section titled “Phase 1. Detect”- Automatic alert (see §9) or customer report.
- On-call engineer confirms, classifies severity, opens an incident ticket.
Phase 2. Contain
Section titled “Phase 2. Contain”- For SEV1 and SEV2, trigger the kill-switch if needed. Disable the AI feature while keeping ERP Sync running.
- Isolate the affected tenant or feature. Tighten rate limits. Block source IPs if it is an attack.
- Take a forensic snapshot. State, logs, prompt history from the observability layer.
Phase 3. Recover
Section titled “Phase 3. Recover”- Identify an initial root cause.
- Apply a temporary fix. Typically a model rollback through the AI Gateway.
- Validate the fix in staging before production.
- Restore service. Notify affected customers.
Phase 4. Post-Incident Review (PIR)
Section titled “Phase 4. Post-Incident Review (PIR)”PIR is mandatory for SEV1 and SEV2 and optional for SEV3. Fixed template:
- Summary. What happened, who was affected, how long it lasted.
- Timeline. Exact sequence from detection to recovery.
- Root cause. 5-Whys or Fishbone. No personal blame.
- Action items. Short-term (1 week), mid-term (1 month), long-term (1 quarter). Each item has an owner and a deadline.
- Lessons. What worked, what did not, what surprised us.
- Customer disclosure. What we tell customers and when.
SEV1 PIRs go to the Board within 30 days. SEV2 PIRs go to the CoE within 14 days.
Customer disclosure
Section titled “Customer disclosure”We commit to specific disclosure SLAs:
- SEV1. Notify within 4 hours of confirmation, with a summary and an ETA for the fix.
- SEV2. Notify within 24 hours, with a summary and a workaround.
- SEV3-4. Aggregated in the quarterly transparency report.
When personal data is involved, we follow the notification obligations of Decree 13/2023. Notice to the competent authority and to the data subjects within the statutory window.