Pillar V: AI Security · § 05

Data catalog and lineage

When a model misbehaves in production, the first question is always the same. What data did this model see? Without a catalog and a lineage graph, the question has no answer. An unanswered question on a security incident becomes weeks of slow investigation. The catalog is the operational tool making the answer cheap.

Context

A data catalog and lineage graph serve two audiences. For engineers, they make datasets discoverable for new use cases. For security and compliance, they let you reconstruct what fed a given model output. Which dataset version, which sources contributed, which transformations applied, and which labelers touched the samples. The catalog is also where Pillar II’s ROPA evidence and Pillar III’s Data Cards live.

What the catalog records

Each dataset entry carries the metadata an auditor or incident responder needs to answer questions without paging an engineer.

Name and version. Versioning is semantic. Major bumps signal a change of source mix or schema.
Classification. Public, Internal, Confidential, or PII, per Pillar IV §2.
Source provenance. Where the data came from, when, and by whom it was contributed.
Schema and sample rows. So engineers evaluate fit without pulling the full dataset.
Size and distribution statistics. Volume, class balance, language mix.
License. For any data derived from open-source corpora.
Retention policy. Inherited from the data classification it carries.
Data Card. Linked to Pillar III §6 (Transparency in Development).

Lineage as a graph

Lineage is not a column on a dataset row. It is a graph of relationships:

Raw source -> Cleaning pipeline v2 -> Curated dataset v1.5
                                          |
                                          +-- Augmented v1.5.aug -> Training set v1.6
                                          |
                                          +-- Filtered v1.5.filt -> Eval set v1.6
                                                                          |
                                                                          v
                                              Model v3.2 (trained on)
                                                                          |
                                                                          v
                                              Production deployment Q2 2026

The value of the graph is reverse traversal. From a suspect production output, you walk back to the model version that produced it, to the training set, to the curated dataset, to the raw source. You stop where the anomaly first appears. That walk turns an open-ended incident into a contained one.

Tooling

We do not build catalog software from scratch. Bizzi adopts an open-source catalog (OpenMetadata or equivalent) customised for AI assets. The same tool serves the Data team for analytics provenance and the CoE for AI provenance, so the catalog is not a parallel inventory that drifts from reality.

Lineage for the RAG corpus

Retrieval-augmented features carry their own lineage. Every chunk in the vector store points back to a source document and a document version. When the source document updates, re-embedding produces a new chunk version. The old version is retained for audit. Each user query that triggered retrieval is logged with the chunks it retrieved, so a citation in a model response is walked back to the exact source. This is what makes Grounded Reasoning (Pillar III §10) verifiable rather than merely claimed.

Integration with the six-step risk method

Lineage is not a museum exhibit. It feeds the risk method directly. Step 2 (Identify threats) flags datasets with uncertain source as a risk. Step 3 (Risk scoring) penalises unknown provenance. Step 5 (Red-teaming) plans tests against each source variant. Without lineage, those steps degrade into generalities.