ADLC Stage 3, Testing (Shadow + A/B)
Evaluation against a test suite is necessary but not sufficient. Real traffic carries distributions no test suite reproduces. Peak load. Off-hours patterns. Edge cases nobody thought to label. Stage 3 puts the new version against real traffic in two steps. First invisibly (Shadow), then with a measured slice of users (A/B).
Context
Section titled “Context”Most production regressions at Bizzi were not caught by evaluation. They were caught by Shadow mode comparing the new and current versions on the same live inputs, or by phased A/B rollout where a 1% slice revealed a failure mode the test suite did not contain. Shadow is cheap insurance. A/B is the controlled experiment turning “looks good” into a defensible decision.
Shadow mode
Section titled “Shadow mode”Shadow mode runs the new version alongside the current one without exposing its output to users.
Procedure.
- Deploy the new version to a staging slot in the AI Gateway.
- Mirror production traffic asynchronously to both versions.
- Log the new version’s output. Do not return it to the user.
- Compare new versus current on agreement rate, latency, cost, and safety regression.
- Run for one to two weeks to cover peak load, edge cases, and off-hours patterns.
Metrics compared.
- Agreement rate. Percentage of cases where the new and current versions agree.
- Disagreement analysis. On a sampled subset, which one is correct? Manual review by the squad’s Data/AI Steward.
- Latency delta. P50, P95, P99.
- Cost delta. Token usage difference per request.
- Safety regression. Any new safety failure the current version did not produce.
Graduation criteria. A new version graduates to A/B when agreement rate exceeds 95% on steady-state traffic, disagreement analysis shows the new version is at least as correct as the current one, latency and cost are within acceptable range, and there is zero safety regression.
A/B testing, phased rollout
Section titled “A/B testing, phased rollout”A new version clearing Shadow mode enters phased rollout to real users.
- Phase 1. 1% traffic. Random sample within opt-in cohorts. Monitor 48 to 72 hours.
- Phase 2. 10% traffic. Expand if Phase 1 is healthy.
- Phase 3. 50% traffic. Required for high-impact use cases as an additional checkpoint.
- Phase 4. 100% traffic. The previous version is retained as fallback for at least 30 days.
Kill criteria. At any phase, the rollout is killed and the version goes back to Stage 1 or 2 if.
- A business KPI (STP rate, accuracy) drops more than 5% versus baseline.
- Latency rises more than 50%.
- Any safety incident occurs.
Cohort selection
Section titled “Cohort selection”A/B is not pure random sampling. We weight cohort selection to manage customer risk.
- Enterprise and banking customers are opted out of early A/B by default. Their workloads are too critical for unreviewed experimentation.
- Sandbox tenants are opted in by default and provide a faster early signal.
- Stratified sampling ensures each customer segment (SMB, Mid, Enterprise) is represented in the eventual 50% phase.
Documentation per test
Section titled “Documentation per test”Every A/B test ships with a written record.
- Hypothesis. What the new version is expected to improve.
- Required sample size. Calculated from statistical power for the target effect size.
- Decision criteria. Kill, ship, or extend, defined before the test starts.
- Final report. Outcome plus lessons, archived with the prompt version.
The “decided before” part matters. Picking criteria after seeing results is how teams convince themselves a regression is a wash.