Pillar IV: Data, AIOps, Infrastructure · § 09

ADLC Stage 4, Deployment (AI Gateway + Fallback)

Direct calls from application code to vendor APIs do not survive the failure modes Bizzi has to handle. A vendor outage at 3am. An unexpected rate limit. A sudden price change. The AI Gateway sits between every application and every LLM vendor, and it is the only thing standing between a vendor incident and a customer-facing outage.

Context

The AI Gateway started as a thin proxy and grew into the load-bearing layer for everything we want to do at runtime. Routing, fallback, rate limiting, observability, and cost attribution. Centralizing these concerns in one layer means we change vendors without touching application code, and we enforce per-tenant cost ceilings without each squad re-implementing the same logic.

What the AI Gateway does

Routing. Decides which model handles each request based on configuration, not hardcoded calls.
Automatic fallback. Switches to a backup model when the primary fails (see below).
API key management. Vendor secrets live in the gateway, never in application code or environment variables.
Rate limiting. Per-tenant and per-IP quotas (Pillar V §10 covers the Denial-of-Wallet defense rationale).
Logging. Every call is logged centrally through the observability layer.
Cost tracking. Token usage is attributed per tenant, per feature, per model.

Fallback routing

Every production model has at least one tested fallback. Fallback triggers when the primary.

Returns a 5xx error.
Exceeds latency threshold (typically P99 baseline × 2).
Hits a vendor rate limit.
Becomes unreachable due to vendor region outage.

The fallback model must have passed Stage 2 evaluation independently. Fallback is not “any other model we have lying around.” Routing logic.

Request → AI Gateway
  ├── Primary model (try)
  │   └── Success → return
  ├── Error / Timeout
  │   └── Fallback model 1 (try)
  │       └── Success → return + alert
  ├── Error / Timeout
  │   └── Fallback model 2 (try)
  │       └── Success → return + escalate alert
  └── All fail → return error to app + page DevSecOps

Vendor concentration limit

No single LLM vendor exceeds 70% of total inference volume. This is a hard architectural constraint, not a preference.

Primary and fallback are deliberately from different vendors.
All prompts are portable across vendors (see §6 Design).
At least one open-source model is deployed in our own infrastructure as a fallback for the most critical use cases.

The goal is concrete. If a vendor unexpectedly raises prices, shuts down API access, or is compelled to disclose data under a government order, we route around them within hours, not weeks.

Promotion process

When a new version clears Stage 3 and is ready for 100% traffic.

Promote. The new version becomes Primary in the AI Gateway routing config.
Retain the old version as fallback for at least 30 days. This is what makes one-click rollback possible.
Update documentation. Model Card updated, release notes published.
Notify customers if behavior changes materially. Release notes go to the customer portal.
Audit. The promotion event is recorded in the immutable audit log.

Rollback

Rollback is triggered by the CoE Lead manually, by a KPI degradation rule from Stage 5 monitoring, or by a SEV1 to SEV2 incident. The mechanism itself is one configuration change in the AI Gateway. Swap primary and fallback. Target rollback time is under five minutes from decision to traffic shifted.