Pillar IV: Data, AIOps, Infrastructure · § 14

Agentic observability

A traditional service maps one request to one response. An agentic workflow fans out into nested LLM calls with shared state, where the answer the user sees comes through ten model invocations. Standard observability tooling does not represent this. Agentic observability is what we built, and bought, to make the runtime visible.

Context

We started with conventional APM and quickly found “function call” and “HTTP request” were not the right primitives. An agent’s reasoning is a tree, not a flat sequence, and the cost and latency of an agent invocation depend on choices inside the tree. Without a tree-structured trace, debugging a misbehaving AI assistant answer took hours of log archaeology. With it, the same investigation runs in minutes.

Trace structure

A workflow forms one trace. Inside the trace are spans.

Each LLM call is a span.
Each tool call (database, external API) is a span.
Each agent invocation creates a sub-trace.

Trace structure lets us navigate from “user asked a question” down to the specific LLM call misbehaving without crawling through unrelated log lines.

What we capture per span

Input. The prompt sent (after PII redaction).
Output. The model response.
Metadata. Model version, temperature, top_p, and any other inference parameter.
Latency. Start time, end time, duration.
Token usage. Input tokens, output tokens, computed cost.
Tags. User, tenant, workflow type, feature flag.
Status. Success, error, timeout.

Prompt registry

The observability layer also acts as the prompt registry.

Every prompt has a version number, never edited in place.
A/B tests serve different versions to different cohorts.
Production deployments reference a specific version, never “latest”.

This is what makes Stage 1 design (§6) auditable and Stage 4 rollback (§9) one click instead of a redeploy.

How we use it

Debug live issues

When a user reports “the AI assistant gave me the wrong answer,” DevSecOps will.

Find the trace by user ID and timestamp.
View the entire workflow. Planner, SQL, aggregator, guardrail.
Identify which span misbehaved.
Inspect prompts, model version, and intermediate outputs.

Time from user report to root cause. Minutes, not hours.

Performance optimization

Token-usage tracking exposes.

Expensive workflows due for a redesign.
Prompts longer than they need to be.
Steps where a smaller model would do equally well.

Cost per Transaction is one of the KPIs Bizzi publishes. Bringing it down over time is the visible output of observability-driven optimization.

Compliance evidence

Stored reasoning traces are evidence for.

The audit trail (Pillar III §11).
DSAR fulfillment. When a data subject asks what an AI decided about them, we have the trace.
Customer Explanation Requests.
Internal red team analysis.

Quality monitoring

LLM-as-a-Judge runs on a 1% sample of production traffic in real time, scoring against Accuracy, Groundedness, and Safety. Aggregate scores track quality drift. Outlier traces are flagged for manual review. This closes the loop between Stage 2 evaluation (which scores prompt versions in CI) and Stage 5 monitoring (which scores live behavior).

Privacy

Reasoning traces contain sensitive content. We apply.

PII redaction before logging. The trace stores the redacted prompt, not the raw one.
Tenant isolation. Observability data is partitioned by tenant. There is no cross-tenant query path.
Retention. Same as audit trail policies, with cold storage at seven years.
Access control. Engineers access tenant traces only with case-by-case approval, logged.

Customer access

Enterprise customers request access to their tenant’s traces through the Explanation API (Pillar III §8), to aggregate metrics through the Customer Portal, and to audit exports for their own compliance team. The trace layer is not only internal. It is the substrate of explainability the customer pays for.