Fairness and non-discrimination
When training data skews toward one customer segment, the model skews with it. An OCR engine trained mostly on large-enterprise invoices reads SME invoices worse. Non-standard fonts. Irregular layouts. More handwriting. The performance gap is not announced. It hides in the average. Our job is to find it and close it.
Context
Section titled “Context”SME and enterprise customers pay the same per-document fee. They receive equivalent AI quality. A model reaching 99% extraction accuracy in aggregate but 92% on SME invoices is not a 99% model. It is two different models, and one is under-served.
How we implement
Section titled “How we implement”- Balanced test cohort. The held-out evaluation set is sized in equal proportions across SME, mid-market, enterprise, and banking customers. Aggregate metrics are computed only after per-cohort metrics pass.
- Per-cohort reporting. Extraction accuracy and STP rate are reported separately for each cohort on every model release. If any cohort’s accuracy drops below 95% of the best-performing cohort, the release is blocked pending investigation.
- Monthly bias audit. A fixed test set runs against current production monthly. Drift in any cohort triggers a ticket.
- Quarterly refresh. The test set is refreshed each quarter with anonymized samples from production so the audit reflects current input distribution.
- Annual external review. An outside reviewer audits the bias-audit methodology itself once a year. We do not grade our own paper indefinitely.
When a gap is found
Section titled “When a gap is found”Remediation follows a fixed sequence. Document the gap in the bias-audit report. Investigate root cause (training data composition, model architecture, prompt). Plan remediation (typically data augmentation or fine-tuning). Rerun the audit to verify the gap closed. Write up the result in a post-incident review if the gap was material.
Other dimensions we audit
Section titled “Other dimensions we audit”Beyond company size, the bias audit also covers:
- Industry. Construction vs IT vs F&B invoices have different layouts and totals.
- Region. North, Central, and South Vietnam differ in number formatting, common names, and tax ID conventions.
- Document age. Invoices older than three years use different formats and degrade OCR.
- Language mix. Vietnamese-only, English-only, and bilingual invoices each have their own failure modes.
Not every disparity is bias
Section titled “Not every disparity is bias”We acknowledge a category distinction. A cohort having a lower STP rate because its invoices systematically lack optional fields is not a bias to fix. It reflects the input distribution. A cohort having lower extraction accuracy on fields which are present is a bias. The audit separates speed and coverage gaps (acceptable, disclosed) from accuracy gaps (not acceptable, blocking).