Rate limiting and denial of wallet (LLM04)
LLM04 is the resource exhaustion risk. Classical denial of service tries to take a service down. The LLM-specific variant (denial of wallet) does not need the service to go down at all. It only needs Bizzi (or one of our customers) to pay an enormous vendor bill for traffic with no legitimate business value. The defense is the same shape as classical rate limiting, with one new layer. Cost.
The two failure modes
Section titled “The two failure modes”The threat has two faces, and our control surface addresses both at once.
- Denial of service. Concurrent expensive requests starve legitimate users of capacity.
- Denial of wallet. Token-priced API calls accumulate a vendor bill that Bizzi or the customer pays. The attacker does not need the service to fall over. They only need our finance team to receive the invoice.
Multi-layer limits
Section titled “Multi-layer limits”Five layers operate together. No single layer is sufficient. They catch different abuse shapes.
| Layer | Default limit | What it catches |
|---|---|---|
| Per-IP | 100 req/min unauthenticated. 1,000 req/min authenticated | Bot floods, credential stuffing |
| Per-user | 60 LLM calls/hour | Compromised account, internal misuse |
| Per-tenant | Plan-dependent. Default 10,000 LLM calls/day | Tenant-wide attack, runaway integration |
| Per-feature | Each feature has its own quota | Noisy feature starving critical paths |
| Cost ceiling | Daily dollar cap, plan-dependent | The denial-of-wallet hard stop |
Adaptive limits
Section titled “Adaptive limits”Static limits are a first approximation. The production limits adapt to three signals. The tenant’s historical traffic pattern, time-of-day (peak vs off-hours), and detected anomalies in burst shape. A tenant that suddenly emits ten times its baseline at 3 a.m. local time gets throttled before it crosses the static cap.
Token complexity check
Section titled “Token complexity check”A single request is cheap by count but ruinous by cost. A prompt with max_tokens=8K and deep agent recursion is one example. The gateway estimates the token cost of each call before it is sent to the vendor. If the estimate exceeds the threshold for the user’s tier, the call is rejected. If it is borderline, the call is throttled or downgraded to a cheaper model.
Cost ceiling enforcement
Section titled “Cost ceiling enforcement”Every tenant has a daily and monthly cost cap (Pillar IV §13). The enforcement has three steps.
- At 80% of the cap. The tenant administrator receives an email alert.
- At 95%. An in-app banner appears and, where possible, AI calls automatically downgrade to a cheaper model.
- At 100%. AI features pause for that tenant. A manual override by a Bizzi operator is required to re-enable them.
This protects Bizzi. It also protects the customer. If an attacker compromises a customer account and tries to burn through their budget, the ceiling stops the bleed for both sides at once.
Detection
Section titled “Detection”Anomaly patterns are detected in real time and alert DevSecOps with full context. Tenant, IP, feature, projected cost. The patterns we monitor for include:
- Bursts of requests from a single IP.
- Sudden shifts in a tenant’s request pattern.
- Token usage spikes concentrated on a single feature.
- Long-running queries with high wall-clock time.
- Geographic anomalies. Requests from regions the tenant has never used before.
The alert pipeline targets Slack so the on-call engineer investigates in the same channel where the rest of the incident lives.
Response playbook
Section titled “Response playbook”When a denial-of-wallet attack is suspected, the response is fast and contained.
- Investigate. Is this an attack or a legitimate usage spike?
- If attack. Block the source IP, tighten the tenant’s rate limit aggressively, revoke any compromised credentials.
- If legitimate. Raise limits temporarily, capacity-plan, and notify the customer success owner.
- Document. Post-incident review for any SEV1 or SEV2 case, per §12.