Skip to content
Pillar V: AI Security · § 07

Sensitive information disclosure (LLM06)

LLM06 is the risk that the model says something it should not. The category bundles four distinct failure modes. PII exposure, system-prompt leakage, cross-tenant data bleed, and memorised training samples. Each needs its own defense. The unifying principle is simple. The model never sees raw sensitive values it does not need, and every output is checked before it reaches a user.

  • PII in context. The model is shown PII that ends up in its output even though the prompt did not require it.
  • System prompt leakage. Prompt injection extracts the instructions we wrote for the model, including any operational logic an attacker exploits.
  • Cross-tenant data bleed. A retrieval or tool call returns data from a different tenant, and the model relays it to the current user.
  • Training data leakage. A fine-tuned model memorises rare or repeated samples and regurgitates them verbatim under adversarial prompts.

PII is redacted before the prompt reaches the model. Tokens are replaced with placeholders that the application layer resolves only for authorised users:

Original: "Payment to Nguyễn Văn A, phone 0901234567"
Model sees: "Payment to [PERSON_1], phone [PHONE_1]"

The mapping table that resolves placeholders is held by the application, not the model. The model reasons about “the buyer” without ever seeing the buyer’s name or phone number. This is the same redaction surface described in Pillar II §2.1.

After the model returns, we scan the response a second time. Three checks fire on every call.

  • Unresolved placeholders. A placeholder that survived back into the response means the model fabricated structure that needs review.
  • Raw PII patterns. Phone numbers, ID numbers, or email addresses present in the output even though redaction removed them from the input.
  • Tenant identifiers. Internal IDs (tenant_id, customer_id) that should not appear in user-facing text.

Responses failing any check are transformed. Replace with a generic descriptor (“the buyer”, “the vendor”), redact entirely if the value is not safely rendered, or block and trigger an incident if the pattern suggests cross-tenant bleed.

Every agent carries the current user’s tenant_id in its context. The output validator enforces two invariants. No response references a tenant_id other than the current one, and every named vendor or customer entity falls inside the current tenant’s scope. Cross-tenant leakage is a SEV1 incident. The Kill-switch criteria in §11 list it.

System prompts are not assumed secret, but they are protected.

  • System prompts never contain secrets (API keys, internal hostnames, credentials).
  • The output validator scans for known fragments of the system prompt that should not appear in user-facing text.
  • Red-team scenarios test for system-prompt extraction on a continuous cadence (Pillar III §7).

Fine-tuned models memorise rare or duplicated samples. Empirically, samples appearing more than five times in a training corpus tend to be reconstructable under adversarial probing. Our mitigations:

  • Deduplication of training data. No duplicate samples enter the corpus.
  • Differential privacy experimentally applied for high-sensitivity fine-tunes.
  • Memorisation testing. We probe candidate models with prefixes drawn from the training set and check for verbatim continuation before promoting them.
  • Train only on redacted data. Raw PII does not enter training corpora. What enters has already passed Layer 1 redaction.

We disclose the redaction architecture to customers at a level meaningful without being a roadmap for attack. An overview of the redaction approach, our memorisation mitigations, and the right to request data deletion (the DSAR path documented in Pillar II §3). The exact regex set and classifier weights are not published.