Pillar III: Ethics, Transparency, Interpretability · § 11

Traceability and audit trail

The engineering team traces any past AI output back to the exact model version, training data, and parameters (prompt, temperature) which produced it. Reproducibility is the foundation of trust. A decision which is not reproducible is not audited. One which is not audited is not trusted with B2B finance.

Context

When a customer disputes an AI decision six months after it shipped, three things are answerable in minutes, not weeks. What model version produced it. What prompt and data it was given. What hyperparameters were in effect. If any of those answers is “we are not sure”, the decision is indefensible, even if the decision was correct. Reproducibility is not a bonus property. It is the artefact which turns a model output into governed evidence.

Triple reproducibility

1. Model version

Model identifier and version (for example bizzi-ocr-v3.2.1 or gpt-4o-2024-08-06).
Hash of the model artefact for self-hosted models.
Vendor version ID for commercial LLMs.

2. Data and prompt version

System prompt. The exact version from the prompt registry maintained by the observability layer.
User input. Full input after PII redaction.
RAG context. The list of chunks retrieved and the version of each chunk (policy v1.2.3).

3. Hyperparameters

Temperature.
Top-p and top-k sampling parameters.
Max tokens.
Seed, when the model supports reproducible inference.

Storage tiers and retention

The audit trail is stored across three tiers driven by retrieval latency and regulatory retention:

Hot storage. Last 90 days, sub-second retrieval, every field complete.
Warm storage. Up to 1 year, minutes to retrieve, every field complete.
Cold storage. Up to 7 years (per the Vietnamese Accounting Law for accounting records), hours to retrieve, compressed and possibly aggregated.

The observability layer

A central observability layer captures everything inferable for every AI call:

Every LLM call with input, output, latency, and token usage.
Reasoning traces for multi-agent workflows.
Sampling for LLM-as-a-Judge evaluation.
Prompt versioning. Every prompt change is a new version, not an edit-in-place.
Comparison view for A/B testing.

API access

Customers query their tenant’s audit trail through the Explanation API (see §8). Internal engineering teams query the observability UI for debugging and post-incident investigation. The two surfaces hit the same store. The difference is scope (tenant vs system) and the level of implementation detail exposed.

Immutability

The audit trail is immutable:

No user, including administrators, edits or deletes an entry.
Every storage operation is itself logged in a tamper-evident log.
A periodic hash chain verifies integrity end-to-end.

Erasure requests from data subjects (DSAR) are handled by replacing PII inside the trail with placeholders, never by deleting the entry. The decision provenance survives the erasure.

Multi-agent traces

When several agents collaborate, the audit trail captures each agent invocation separately, every inter-agent message, state transitions in the state graph, and the termination condition (for example, max_recursion_depth). A full reasoning trace for a multi-agent task contains 20 to 50 LLM calls. Every call links back to a single parent transaction ID so the entire chain is queryable as one unit.