Skip to content

Planted Flaws — AuditForge Test Corpus

This corpus is synthetic. It simulates a defense contractor's task-order package, with deliberate flaws planted across the document set so we can score AuditForge's recall objectively.

Do NOT distribute this directory as a customer-facing demo — these documents are a scoring rig, not real audit content.

Domain framing

Northstar Defense Inc. — a notional mid-tier defense contractor. The corpus represents documents Northstar would assemble for a CMMC L2 pre-assessment audit:

  • 1 master contract (prime contract with the federal agency)
  • 2 task orders under the master
  • 2 subcontracts (one to each of 2 lower-tier vendors)
  • 4 internal policies / SOPs
  • 2 training records / attestations
  • 1 quality assurance — actually missing (planted gap)
  • 1 incident response plan — actually present but in DRAFT status (planted gap-with-caveat)

Total: 12 documents.

Planted flaws (ground truth for scoring)

Conflict findings (conflict_check)

P-1: Cybersecurity training cadence conflict - Master contract clause 3.2: "Contractor shall provide annual cybersecurity awareness training to all personnel." - Subcontract Alpha §5.4: "Subcontractor shall complete cybersecurity training every 90 days." - Subcontract Bravo §5.4: "Subcontractor shall provide annual training, with refresher every 6 months." - Expected finding: Three different cadences for the same requirement. HIGH severity.

P-2: Key Personnel substitution authority conflict - Master contract §7.1: "No substitution of Key Personnel without 30-day prior written notice and Contracting Officer approval." - Project Management SOP §4.2: "Key Personnel substitutions may be made by the Program Manager with internal sign-off." - Expected finding: Internal SOP authorizes substitutions the master forbids. CRITICAL severity.

Coverage gaps (coverage_check)

P-3: Quality Assurance Plan absent - Master contract §6.1 explicitly requires "a Quality Assurance Plan submitted within 30 days of contract award." - No QAP document is in the corpus. - Expected finding: Required deliverable missing. CRITICAL severity.

P-4: Incident Response Plan in draft status - Master contract §8.2 requires "an approved Incident Response Plan." - The IR Plan in the corpus is marked [DRAFT - NOT APPROVED] in its header. - Expected finding: Required element present but not in approved state. HIGH severity.

P-5: Subcontractor cybersecurity attestation missing for Subcontract Bravo - Master contract §3.4 requires each subcontractor to provide a cybersecurity self-attestation. - Subcontract Alpha has its attestation file. Subcontract Bravo does not. - Expected finding: Coverage gap on one subcontractor. HIGH severity.

Currency / supersession (currency_check)

P-6: Superseded NIST publication reference - Cybersecurity Policy v3 cites "NIST SP 800-171 r2" as the basis for control mappings. - The corpus's most recent docs (post-2024) should reference r3. - Expected finding: Outdated standard reference. MEDIUM severity.

P-7: Stale ITAR clause language - Subcontract Alpha quotes EAR Part 744 language from a 2021 revision. - The corpus operates in a 2025+ context where Part 744 was amended. - Expected finding: Stale regulatory text. MEDIUM severity.

Consistency / definitional drift (consistency_check)

P-8: "Controlled Unclassified Information" defined inconsistently - Master contract §2.0: "CUI means information requiring safeguarding under 32 CFR 2002, including For Official Use Only and Sensitive But Unclassified categories." - Cybersecurity Policy v3 §1.1: "CUI is any document marked CONFIDENTIAL or higher." - Expected finding: Two materially different definitions. HIGH severity.

P-9: "Effective Date" defined inconsistently - Master contract §1.0: "Effective Date means the date of last signature on this contract." - Subcontract Alpha §1.0: "Effective Date means the date Subcontractor begins performance, which may differ from contract execution." - Expected finding: Definition drift; downstream date references are ambiguous. MEDIUM severity.

Flow-down failures (flow_down_check)

P-10: Cybersecurity audit-rights clause not flowed down to Subcontract Alpha - Master contract §4.1 grants the Contracting Officer audit rights over cybersecurity practices. - Subcontract Alpha contains no parallel clause. - Expected finding: Master flow-down clause absent in subcontract. HIGH severity.

P-11: Personnel security clearance flow-down absent in Subcontract Bravo - Master contract §5.0 requires all personnel to hold a SECRET clearance for CUI handling. - Subcontract Bravo §5.0 has no clearance requirement. - Expected finding: Personnel security flow-down absent. HIGH severity.

Citation integrity (citation_integrity_check)

P-12: Misrepresented FAR clause - Project Management SOP §3.1 states: "Per FAR 52.204-21, contractors are required to provide quarterly cybersecurity reports." - FAR 52.204-21 is the Basic Safeguarding clause; it does NOT specify quarterly reporting. - Expected finding: Citation misrepresented. MEDIUM severity.

P-13: Misidentified NIST publication - Cybersecurity Policy v3 cites "NIST SP 800-53 r5" as the basis for FedRAMP control selection. - For CMMC L2 / NIST 800-171 environments, 800-53 is not the operative standard. (800-171 is.) - Expected finding: Wrong standard cited for the context. LOW severity.

Total flaws by severity

Severity Count
CRITICAL 2 (P-2, P-3)
HIGH 6 (P-1, P-4, P-5, P-8, P-10, P-11)
MEDIUM 4 (P-6, P-7, P-9, P-12)
LOW 1 (P-13)
Total 13

Total flaws by primitive

Primitive Count Flaws
conflict_check 2 P-1, P-2
coverage_check 3 P-3, P-4, P-5
currency_check 2 P-6, P-7
consistency_check 2 P-8, P-9
flow_down_check 2 P-10, P-11
citation_integrity_check 2 P-12, P-13

Recall scoring

After running an audit, compare findings to this ground truth:

  • Recall = (planted flaws found) / 13
  • Precision = (planted flaws found) / (total findings produced)
  • False positives = findings that don't correspond to any planted flaw

Target for first dogfood: ≥70% recall on HIGH+CRITICAL, ≥40% recall overall. Precision is harder to measure (some "false positives" may be real findings the synthetic corpus accidentally introduced); manual review needed.

Notes

  • The corpus is intentionally lightweight — short, clearly-structured documents. Real-world audits will have far more noise. Performance here is an upper bound; real-corpus performance will be lower.
  • Flaws are deliberately diverse — at least one per primitive — so every primitive gets exercised at least once per audit run.
  • Some flaws cross-cut (e.g., P-1 and P-2 are both 'conflict' but about different topics) so cluster behavior in Stage F gets tested.
  • Citation flaws (P-12, P-13) test the LLM's training-data knowledge. External citation verification (Phase 2) would catch these more reliably.