Evidence-Quote Auto-Verification¶

Status: ✅ Complete Spans: app/auditforge/orchestrator.py (verify_quote, parse_evidence, build_finding), app/auditforge/findings.py (EvidenceCitation), app/auditforge/report.py (markdown rendering) Tests: tests/test_auditforge_evidence_verification.py — 20 cases passing

Why this exists¶

Every finding's defensibility rests on its evidence chain. If the LLM hallucinates a quote, the firm's partner unknowingly stakes their name on a fabrication. One bad quote in a deliverable can lose an account.

Evidence-quote auto-verification is a precision-side power move: every LLM-cited quote is substring-matched against the retrieved chunk text before delivery. Quotes that can't be verified flag the finding for auditor review and penalize the finding's confidence score.

This is near-free precision — pure-Python string matching, no LLM calls — but it shifts the failure mode from silent fabrication to flagged-for-review. That's the difference between malpractice risk and acceptable risk.

What gets verified¶

For each LLM-emitted finding, every EvidenceCitation's verbatim_quote is checked against the question's retrieval_results chunk text. The check uses verify_quote(quote, chunks):

Normalize both quote and haystack (lowercase, whitespace-collapsed)
Exact match: is the quote a substring of any chunk's text?
Span match (paraphrase tolerance): does any 40-character (or 60% of quote, whichever is shorter) substring of the quote appear in any chunk's text?

The two-stage check tolerates minor LLM paraphrasing while still requiring real anchor text. A fully hallucinated quote — content not appearing in any retrieved chunk — fails both stages.

What happens when quotes don't verify¶

EvidenceCitation carries quote_verified: bool = True. The flag is set to False at parse_evidence time when verification fails.

In build_finding, _apply_evidence_verification_penalty() applies a confidence multiplier based on the unverified ratio:

Unverified ratio	Penalty	Auditor flag
0%	None	None
1% – 49%	confidence × max(0.7, 1 − ratio × 0.4)	None
≥ 50%	confidence × 0.5	"Majority of cited quotes could not be located..." appended to auditor_notes

The 50% threshold reflects auditor judgment: a single misquoted citation is recoverable; majority-unverified evidence is a defensibility crisis the partner needs to see explicitly.

Where it surfaces¶

Surface	Behavior
`Finding.confidence`	Reduced per the penalty schedule
`Finding.auditor_notes`	Appended with verification flag when ≥50% unverified
`EvidenceCitation.quote_verified`	False on unverified
Markdown deliverable	"(unverified)" marker after the doc/section anchor
JSON deliverable	`quote_verified: false` field
Engagement summary	Reduced confidence flows into severity rollup naturally

Why this is a precision move, not a recall move¶

Verification only flags / penalizes — it doesn't drop findings outright. A finding with unverified evidence still appears in the deliverable; the auditor decides whether to accept, refine, or reject.

This is intentional. False rejections are also costly. The auditor is the final arbiter; AuditForge's job is to surface the signal, not pre- adjudicate it. The verification system gives the auditor the data to make an informed decision.

Behavior on edge cases¶

Empty quote: verify_quote returns False. parse_evidence already filters empty quotes out.
Empty chunks (no retrieval): Returns False. Findings produced without retrieval (very rare — Stage E retrieves before LLM call) get flagged as unverified across the board.
Whitespace differences: Normalized. "annual review required" matches "annual review required."
Case differences: Normalized. "ANNUAL TRAINING" matches "annual training."
Short quotes (< 40 chars): Require exact substring match (no span fallback). A 10-character paraphrase that doesn't appear verbatim fails — desired behavior, since short paraphrases are most prone to hallucination.

Test coverage¶

Area	Cases
`verify_quote` (pure)	10 (exact, case-insensitive, whitespace-normalized, paraphrase span match, hallucination rejected, empty quote, empty chunks, short-quote exact, multi-chunk match, long-quote partial)
`parse_evidence` flag setting	3 (verified, unverified, mixed)
`_apply_evidence_verification_penalty`	5 (no penalty when all verified, severe at 50%+, mild at <50%, no evidence no-op, all unverified)
`build_finding` integration	2 (mixed evidence applies penalty + flag, all-verified no penalty/flag)

All 20 cases passing. Full suite: 422 pass, no regressions.

Cost shape¶

Zero. Pure string matching, no LLM calls. This is the highest-ROI power move in the AuditForge architecture.

Public API¶

from app.auditforge.orchestrator import verify_quote, parse_evidence

# Pure verification
is_real = verify_quote(quote_string, retrieved_chunks)

# Parse with verification (called inside build_finding)
evidence = parse_evidence(llm_evidence_array, retrieved_chunks)
for ev in evidence:
    print(ev.verbatim_quote, ev.quote_verified)

Known limits / future work¶

No external-source verification. Quotes from cited standards (FAR/DFARS/NIST) aren't checked against eCFR or NIST publication databases. The citation_integrity_check primitive will close this gap when external verification ships.
Span-match heuristic is fixed. 40-char minimum / 60% threshold may need tuning per domain. Legal corpora with verbose statutory language behave differently from technical SOPs. Adaptive thresholds are a Phase 2 refinement.
Verification doesn't catch out-of-context quotes. A quote may match a chunk verbatim but the LLM may have applied it to the wrong finding context. Catching this requires semantic verification (an adversarial pass — see next commit).
No fuzzy-match (e.g., Levenshtein). Substring matching is exact; typos in either source or quote can defeat it. Acceptable trade-off given the cost vs. value.