Skip to content
Pillar V: AI Security · § 10

Rate limiting and denial of wallet (LLM04)

LLM04 is the resource exhaustion risk. Classical denial of service tries to take a service down. The LLM-specific variant (denial of wallet) does not need the service to go down at all. It only needs Bizzi (or one of our customers) to pay an enormous vendor bill for traffic with no legitimate business value. The defense is the same shape as classical rate limiting, with one new layer. Cost.

The threat has two faces, and our control surface addresses both at once.

  • Denial of service. Concurrent expensive requests starve legitimate users of capacity.
  • Denial of wallet. Token-priced API calls accumulate a vendor bill that Bizzi or the customer pays. The attacker does not need the service to fall over. They only need our finance team to receive the invoice.

Five layers operate together. No single layer is sufficient. They catch different abuse shapes.

LayerDefault limitWhat it catches
Per-IP100 req/min unauthenticated. 1,000 req/min authenticatedBot floods, credential stuffing
Per-user60 LLM calls/hourCompromised account, internal misuse
Per-tenantPlan-dependent. Default 10,000 LLM calls/dayTenant-wide attack, runaway integration
Per-featureEach feature has its own quotaNoisy feature starving critical paths
Cost ceilingDaily dollar cap, plan-dependentThe denial-of-wallet hard stop

Static limits are a first approximation. The production limits adapt to three signals. The tenant’s historical traffic pattern, time-of-day (peak vs off-hours), and detected anomalies in burst shape. A tenant that suddenly emits ten times its baseline at 3 a.m. local time gets throttled before it crosses the static cap.

A single request is cheap by count but ruinous by cost. A prompt with max_tokens=8K and deep agent recursion is one example. The gateway estimates the token cost of each call before it is sent to the vendor. If the estimate exceeds the threshold for the user’s tier, the call is rejected. If it is borderline, the call is throttled or downgraded to a cheaper model.

Every tenant has a daily and monthly cost cap (Pillar IV §13). The enforcement has three steps.

  • At 80% of the cap. The tenant administrator receives an email alert.
  • At 95%. An in-app banner appears and, where possible, AI calls automatically downgrade to a cheaper model.
  • At 100%. AI features pause for that tenant. A manual override by a Bizzi operator is required to re-enable them.

This protects Bizzi. It also protects the customer. If an attacker compromises a customer account and tries to burn through their budget, the ceiling stops the bleed for both sides at once.

Anomaly patterns are detected in real time and alert DevSecOps with full context. Tenant, IP, feature, projected cost. The patterns we monitor for include:

  • Bursts of requests from a single IP.
  • Sudden shifts in a tenant’s request pattern.
  • Token usage spikes concentrated on a single feature.
  • Long-running queries with high wall-clock time.
  • Geographic anomalies. Requests from regions the tenant has never used before.

The alert pipeline targets Slack so the on-call engineer investigates in the same channel where the rest of the incident lives.

When a denial-of-wallet attack is suspected, the response is fast and contained.

  1. Investigate. Is this an attack or a legitimate usage spike?
  2. If attack. Block the source IP, tighten the tenant’s rate limit aggressively, revoke any compromised credentials.
  3. If legitimate. Raise limits temporarily, capacity-plan, and notify the customer success owner.
  4. Document. Post-incident review for any SEV1 or SEV2 case, per §12.