Pillar IV: Data, AIOps, Infrastructure · § 04

Data quality

A strong model does not compensate for weak input. In AP automation, our input is invoices in every format on the market. E-invoices issued through the tax authority’s API at one end. Blurry phone photos of photocopies at the other. The data quality pipeline is not a nice-to-have. It sits between the customer and the model, and what it lets through decides what the model is judged on.

Context

Two failure modes cost us most. Characters looking right but encoded wrong, which break downstream matching against vendor master data. Confidence looking high but reflecting pattern memorization rather than real extraction. Both are invisible until a customer’s accountant reconciles against their ERP and the numbers do not tie out. The pipeline catches them before they reach the customer.

The four-stage pipeline

Input validation. Format check (PDF, image type). Size check (small files often indicate blur). Resolution check for images. We reject or warn rather than silently proceed.
OCR and text extraction. OCR engine extracts text with bounding boxes and per-character confidence scores. Page rotation is detected and corrected. Language detection handles Vietnamese, English, and bilingual invoices.
Cleaning. Remove control characters and encoding artifacts. Normalize whitespace. Normalize Vietnamese diacritics, including short and long vowel marks and combined diacritic forms looking identical but breaking string matching. Standardize number formats (1.000 vs 1,000 vs 1 000) and date formats (DD/MM/YYYY vs MM/DD/YYYY).
Field validation. Tax ID (Mã số thuế) check-digit validation. Line-item-to-total consistency. Date sanity (invoices dated in the future are flagged). Vendor cross-check against tenant master data.

Quality KPIs we track

OCR error rate per field. The percentage of fields where extraction does not match a human-verified ground truth on our continuous sample.
Validation failure rate. The percentage of invoices failing at least one field validation rule.
Field coverage. The percentage of invoices with every required field populated.
Confidence distribution. The shape of the confidence histogram. A sudden skew is an early signal of drift (§10).

All four run through the observability stack in real time and alert when they cross threshold.

When quality is poor, we do not auto-fix

Bizzi does not let the model “best-guess” a field with low confidence and ship it. The three escalation paths.

Low confidence below threshold. Trigger a Human-in-the-Loop review (Pillar III §1).
Validation failure. Flag the field for manual entry by the customer’s accountant.
Unusual pattern. Escalate to the tenant administrator. The cause might be a new vendor onboarding, or it might be fraud.

The honesty matters. Hiding poor input by guessing pretends to a level of accuracy we did not reach.

Training data quality

The same discipline applies to anything we train or fine-tune on.

Source verification. Only approved sources, documented in the Data Card.
Manual labeling. Senior QA labels the sample. Junior staff label at scale.
Inter-rater agreement. Measure agreement between labelers. Gaps trigger re-labeling or guideline clarification.
Adversarial samples. Proactively include edge cases. Rotated, low-resolution, multi-language. The model stays robust to them.