Skip to content
Pillar V: AI Security · § 04

Training data poisoning (LLM03)

Training data poisoning is LLM03. It is the attack that hides in production for the longest. Once a poisoned sample is baked into model weights, no amount of output filtering removes it. The backdoor sits dormant until the attacker triggers it. The only effective defense is upstream. Gate what enters the corpus, attest its source, and keep a lineage so that if something behaves strangely later, you find what fed it.

The pattern is straightforward and well-documented. An attacker submits crafted samples to a training set so the model learns an association the operator did not intend. For Bizzi the realistic shapes are:

  • Crafted training invoices. Attacker submits many synthetic invoices with pattern X labelled approved, so the model learns to associate pattern X with approval even when X is a fraud indicator.
  • Adversarial samples in public corpora. OCR fine-tunes pulling from public scraping absorb steganographic samples designed to flip specific classifier outputs.
  • Compromised vendor labeler. A third-party labelling vendor with weak controls becomes an injection vector for adversarial labels.

The consequence is the same in all three cases. A backdoor sleeps in the weights until production conditions activate it.

Every training dataset carries an attestation that names where it came from. Four source classes are accepted, and no others.

  • Internal data. Anonymised customer data with consent. Full audit trail, owner named.
  • Public corpora. Only from reputable sources (Common Crawl, Wikipedia, Vietnam government open data). License verified against the OSS matrix.
  • Synthetic data. The generation pipeline is reviewed and monitored. The pipeline’s outputs are sampled.
  • Third-party labelled data. Vendor has cleared security review. Random freelance labelling is not used.

New sources go through security review before they are added to the pipeline. The matrix is owned by the CoE Data Lead.

The phrase is literal. Every sample is reviewed before it joins the training corpus:

  1. Dataset passes automated checks. Size, format, language, schema conformance.
  2. Senior QA labels a sample manually.
  3. A second QA spot-checks for inter-rater agreement.
  4. Outlier detection flags samples far from the distribution median for manual review.
  5. Adversarial pattern detection flags samples matching known attack signatures.
  6. Steward and CoE Lead sign off before the dataset is approved.

A dataset containing even one unresolved suspicious sample does not pass. This is not a guideline. It is a gate.

When we augment (rotation, noise, paraphrase), the augmentation pipeline runs in a sandboxed environment, outputs are sample-checked, and augmentation rules go through code review. Arbitrary code execution during augmentation is not permitted.

Every training dataset version carries a lineage document covering source composition by percentage, collection dates, labelers involved (named or pseudonymised), the QA verification log, and a SHA-256 hash of the final dataset for tamper detection. When a model behaves unexpectedly in production, lineage is what lets you find which dataset version it saw and which source introduced the behaviour. The full lineage architecture is described in §5.

Model weights are signed after training, and the loader verifies the hash before the model serves traffic. Periodic regression checks against a fixed set of canary inputs flag drift indicating unauthorised modification.

For commercial vendor LLMs we do not control the training pipeline. Three mitigations apply. The vendor’s data hygiene practices are assessed as part of vendor risk (Pillar II §6). Output is spot-checked for anomalous behaviour via LLM-as-a-Judge sampling. The architecture preserves our ability to swap vendors quickly if an issue is found (Pillar IV §6).