Pillar V: AI Security · § 10

Rate limiting and denial of wallet (LLM04)

LLM04 is the resource exhaustion risk. Classical denial of service tries to take a service down. The LLM-specific variant (denial of wallet) does not need the service to go down at all. It only needs Bizzi (or one of our customers) to pay an enormous vendor bill for traffic with no legitimate business value. The defense is the same shape as classical rate limiting, with one new layer. Cost.

The two failure modes

The threat has two faces, and our control surface addresses both at once.

Denial of service. Concurrent expensive requests starve legitimate users of capacity.
Denial of wallet. Token-priced API calls accumulate a vendor bill that Bizzi or the customer pays. The attacker does not need the service to fall over. They only need our finance team to receive the invoice.

Multi-layer limits

Five layers operate together. No single layer is sufficient. They catch different abuse shapes.

Layer	Default limit	What it catches
Per-IP	100 req/min unauthenticated. 1,000 req/min authenticated	Bot floods, credential stuffing
Per-user	60 LLM calls/hour	Compromised account, internal misuse
Per-tenant	Plan-dependent. Default 10,000 LLM calls/day	Tenant-wide attack, runaway integration
Per-feature	Each feature has its own quota	Noisy feature starving critical paths
Cost ceiling	Daily dollar cap, plan-dependent	The denial-of-wallet hard stop

Adaptive limits

Static limits are a first approximation. The production limits adapt to three signals. The tenant’s historical traffic pattern, time-of-day (peak vs off-hours), and detected anomalies in burst shape. A tenant that suddenly emits ten times its baseline at 3 a.m. local time gets throttled before it crosses the static cap.

Token complexity check

A single request is cheap by count but ruinous by cost. A prompt with max_tokens=8K and deep agent recursion is one example. The gateway estimates the token cost of each call before it is sent to the vendor. If the estimate exceeds the threshold for the user’s tier, the call is rejected. If it is borderline, the call is throttled or downgraded to a cheaper model.

Cost ceiling enforcement

Every tenant has a daily and monthly cost cap (Pillar IV §13). The enforcement has three steps.

At 80% of the cap. The tenant administrator receives an email alert.
At 95%. An in-app banner appears and, where possible, AI calls automatically downgrade to a cheaper model.
At 100%. AI features pause for that tenant. A manual override by a Bizzi operator is required to re-enable them.

This protects Bizzi. It also protects the customer. If an attacker compromises a customer account and tries to burn through their budget, the ceiling stops the bleed for both sides at once.

Detection

Anomaly patterns are detected in real time and alert DevSecOps with full context. Tenant, IP, feature, projected cost. The patterns we monitor for include:

Bursts of requests from a single IP.
Sudden shifts in a tenant’s request pattern.
Token usage spikes concentrated on a single feature.
Long-running queries with high wall-clock time.
Geographic anomalies. Requests from regions the tenant has never used before.

The alert pipeline targets Slack so the on-call engineer investigates in the same channel where the rest of the incident lives.

Response playbook

When a denial-of-wallet attack is suspected, the response is fast and contained.

Investigate. Is this an attack or a legitimate usage spike?
If attack. Block the source IP, tighten the tenant’s rate limit aggressively, revoke any compromised credentials.
If legitimate. Raise limits temporarily, capacity-plan, and notify the customer success owner.
Document. Post-incident review for any SEV1 or SEV2 case, per §12.