Pillar IV: Data, AIOps, Infrastructure · § 13

Agentic state management

A multi-agent workflow without hard caps is one bad plan away from an unbounded API loop draining tokens, blocking resources, and billing the customer for nothing. Hard caps are not a polish item. They are the difference between a multi-agent system in production and a multi-agent system taking down a tenant.

Context

Every team shipping agents will, sooner or later, observe a workflow loop in production. Planner does not converge. Agent A keeps calling agent B, which keeps calling agent A. A tool retry keeps failing and being retried. The question is not whether it happens but whether the caps catch it before the cost explodes or the user gives up. We treat state management as a first-class design concern because the consequences scale faster than the team’s reaction time.

The loop modes we have to defend against

Direct recursion. Agent A calls itself.
Indirect recursion. Agent A calls B, B calls C, C calls A.
Plan-refinement loop. The planner keeps re-planning without converging on an executable plan.
Tool retry loop. An agent retries a failing tool call indefinitely.

When any of these happen, costs explode (every loop burns tokens), latency tanks (the user waits for a response which will never come), and downstream resources (memory, connections, vendor rate limits) get exhausted.

The three hard caps

`max_recursion_depth`

The maximum number of agent steps (LLM calls) for a single user query. Default is 10. Tunable per use case.

config = {
    "max_recursion_depth": 10,
    "current_depth": 0,
}
# Every agent step:
if config["current_depth"] >= config["max_recursion_depth"]:
    raise WorkflowTerminationError("Max depth exceeded")
config["current_depth"] += 1

`max_tokens_per_workflow`

Total token budget for a workflow. Default is 50K tokens. Exceeding it terminates the workflow.

This exists because depth alone is not a sufficient cap. One agent returning a massive output blows the budget even at shallow depth.

`max_wallclock_per_workflow`

Maximum wall-clock duration. Default is 60 seconds for interactive workflows, 300 seconds for async. Exceeding it terminates and returns the best partial result.

Break conditions beyond hard caps

The caps are the floor. We layer conditional break conditions on top.

Convergence detection. If the planner outputs the same plan three times in a row, it is looping. Terminate.
No-progress detection. If the workflow state has not changed across N steps, terminate.
Error threshold. If tool failures exceed threshold, degrade gracefully rather than retry forever.

Cost ceiling per query

A single query does not exceed $X in cost, where X is configured per tenant tier. The ceiling has two thresholds.

Soft (80%). The orchestrator switches to cheaper models for the remaining steps and the agent is informed.
Hard (100%). The workflow terminates and returns the best partial result.

Cost tracking runs through the AI Gateway in real time, so the orchestrator applies these thresholds mid-workflow.

Cost ceiling per tenant

Each tenant has a daily and monthly cost ceiling.

As consumption approaches the ceiling, the tenant administrator is alerted.
When the ceiling is hit, further requests are rate-limited (Pillar V §10, Denial-of-Wallet defense).

The tenant ceiling protects both Bizzi (against cost overrun) and the tenant (against a Denial-of-Wallet attack where an external actor floods the system with expensive queries against the tenant’s account).

State persistence

Workflow state is persisted in an in-memory store or an equivalent store. This buys us three things.

Resume on restart. A workflow continues if the orchestrator restarts mid-execution.
Async multi-step workflows. Long-running workflows do not hold an open connection for the full duration.
Post-mortem debugging. When a workflow fails, the state is still there for inspection.

State carries a TTL. It cleans up automatically once the workflow completes or times out.

Monitoring the caps themselves

The observability layer tracks the cap distributions.

Recursion depth distribution. If many queries run close to max_recursion_depth, either we raise the cap or fix the planner.
Token spend per workflow type. Anomaly detection catches a single workflow class drifting up.
Latency outliers. The slow tail is investigated and fed back into Stage 1.