Planted Flaws — AuditForge Test Corpus¶
This corpus is synthetic. It simulates a defense contractor's task-order package, with deliberate flaws planted across the document set so we can score AuditForge's recall objectively.
Do NOT distribute this directory as a customer-facing demo — these documents are a scoring rig, not real audit content.
Domain framing¶
Northstar Defense Inc. — a notional mid-tier defense contractor. The corpus represents documents Northstar would assemble for a CMMC L2 pre-assessment audit:
- 1 master contract (prime contract with the federal agency)
- 2 task orders under the master
- 2 subcontracts (one to each of 2 lower-tier vendors)
- 4 internal policies / SOPs
- 2 training records / attestations
- 1 quality assurance — actually missing (planted gap)
- 1 incident response plan — actually present but in DRAFT status (planted gap-with-caveat)
Total: 12 documents.
Planted flaws (ground truth for scoring)¶
Conflict findings (conflict_check)¶
P-1: Cybersecurity training cadence conflict - Master contract clause 3.2: "Contractor shall provide annual cybersecurity awareness training to all personnel." - Subcontract Alpha §5.4: "Subcontractor shall complete cybersecurity training every 90 days." - Subcontract Bravo §5.4: "Subcontractor shall provide annual training, with refresher every 6 months." - Expected finding: Three different cadences for the same requirement. HIGH severity.
P-2: Key Personnel substitution authority conflict - Master contract §7.1: "No substitution of Key Personnel without 30-day prior written notice and Contracting Officer approval." - Project Management SOP §4.2: "Key Personnel substitutions may be made by the Program Manager with internal sign-off." - Expected finding: Internal SOP authorizes substitutions the master forbids. CRITICAL severity.
Coverage gaps (coverage_check)¶
P-3: Quality Assurance Plan absent - Master contract §6.1 explicitly requires "a Quality Assurance Plan submitted within 30 days of contract award." - No QAP document is in the corpus. - Expected finding: Required deliverable missing. CRITICAL severity.
P-4: Incident Response Plan in draft status
- Master contract §8.2 requires "an approved Incident Response Plan."
- The IR Plan in the corpus is marked [DRAFT - NOT APPROVED] in its header.
- Expected finding: Required element present but not in approved state. HIGH severity.
P-5: Subcontractor cybersecurity attestation missing for Subcontract Bravo - Master contract §3.4 requires each subcontractor to provide a cybersecurity self-attestation. - Subcontract Alpha has its attestation file. Subcontract Bravo does not. - Expected finding: Coverage gap on one subcontractor. HIGH severity.
Currency / supersession (currency_check)¶
P-6: Superseded NIST publication reference - Cybersecurity Policy v3 cites "NIST SP 800-171 r2" as the basis for control mappings. - The corpus's most recent docs (post-2024) should reference r3. - Expected finding: Outdated standard reference. MEDIUM severity.
P-7: Stale ITAR clause language - Subcontract Alpha quotes EAR Part 744 language from a 2021 revision. - The corpus operates in a 2025+ context where Part 744 was amended. - Expected finding: Stale regulatory text. MEDIUM severity.
Consistency / definitional drift (consistency_check)¶
P-8: "Controlled Unclassified Information" defined inconsistently - Master contract §2.0: "CUI means information requiring safeguarding under 32 CFR 2002, including For Official Use Only and Sensitive But Unclassified categories." - Cybersecurity Policy v3 §1.1: "CUI is any document marked CONFIDENTIAL or higher." - Expected finding: Two materially different definitions. HIGH severity.
P-9: "Effective Date" defined inconsistently - Master contract §1.0: "Effective Date means the date of last signature on this contract." - Subcontract Alpha §1.0: "Effective Date means the date Subcontractor begins performance, which may differ from contract execution." - Expected finding: Definition drift; downstream date references are ambiguous. MEDIUM severity.
Flow-down failures (flow_down_check)¶
P-10: Cybersecurity audit-rights clause not flowed down to Subcontract Alpha - Master contract §4.1 grants the Contracting Officer audit rights over cybersecurity practices. - Subcontract Alpha contains no parallel clause. - Expected finding: Master flow-down clause absent in subcontract. HIGH severity.
P-11: Personnel security clearance flow-down absent in Subcontract Bravo - Master contract §5.0 requires all personnel to hold a SECRET clearance for CUI handling. - Subcontract Bravo §5.0 has no clearance requirement. - Expected finding: Personnel security flow-down absent. HIGH severity.
Citation integrity (citation_integrity_check)¶
P-12: Misrepresented FAR clause - Project Management SOP §3.1 states: "Per FAR 52.204-21, contractors are required to provide quarterly cybersecurity reports." - FAR 52.204-21 is the Basic Safeguarding clause; it does NOT specify quarterly reporting. - Expected finding: Citation misrepresented. MEDIUM severity.
P-13: Misidentified NIST publication - Cybersecurity Policy v3 cites "NIST SP 800-53 r5" as the basis for FedRAMP control selection. - For CMMC L2 / NIST 800-171 environments, 800-53 is not the operative standard. (800-171 is.) - Expected finding: Wrong standard cited for the context. LOW severity.
Total flaws by severity¶
| Severity | Count |
|---|---|
| CRITICAL | 2 (P-2, P-3) |
| HIGH | 6 (P-1, P-4, P-5, P-8, P-10, P-11) |
| MEDIUM | 4 (P-6, P-7, P-9, P-12) |
| LOW | 1 (P-13) |
| Total | 13 |
Total flaws by primitive¶
| Primitive | Count | Flaws |
|---|---|---|
conflict_check |
2 | P-1, P-2 |
coverage_check |
3 | P-3, P-4, P-5 |
currency_check |
2 | P-6, P-7 |
consistency_check |
2 | P-8, P-9 |
flow_down_check |
2 | P-10, P-11 |
citation_integrity_check |
2 | P-12, P-13 |
Recall scoring¶
After running an audit, compare findings to this ground truth:
- Recall = (planted flaws found) / 13
- Precision = (planted flaws found) / (total findings produced)
- False positives = findings that don't correspond to any planted flaw
Target for first dogfood: ≥70% recall on HIGH+CRITICAL, ≥40% recall overall. Precision is harder to measure (some "false positives" may be real findings the synthetic corpus accidentally introduced); manual review needed.
Notes¶
- The corpus is intentionally lightweight — short, clearly-structured documents. Real-world audits will have far more noise. Performance here is an upper bound; real-corpus performance will be lower.
- Flaws are deliberately diverse — at least one per primitive — so every primitive gets exercised at least once per audit run.
- Some flaws cross-cut (e.g., P-1 and P-2 are both 'conflict' but about different topics) so cluster behavior in Stage F gets tested.
- Citation flaws (P-12, P-13) test the LLM's training-data knowledge. External citation verification (Phase 2) would catch these more reliably.