Skip to content

Evidence-Quote Auto-Verification

Status: ✅ Complete Spans: app/auditforge/orchestrator.py (verify_quote, parse_evidence, build_finding), app/auditforge/findings.py (EvidenceCitation), app/auditforge/report.py (markdown rendering) Tests: tests/test_auditforge_evidence_verification.py — 20 cases passing

Why this exists

Every finding's defensibility rests on its evidence chain. If the LLM hallucinates a quote, the firm's partner unknowingly stakes their name on a fabrication. One bad quote in a deliverable can lose an account.

Evidence-quote auto-verification is a precision-side power move: every LLM-cited quote is substring-matched against the retrieved chunk text before delivery. Quotes that can't be verified flag the finding for auditor review and penalize the finding's confidence score.

This is near-free precision — pure-Python string matching, no LLM calls — but it shifts the failure mode from silent fabrication to flagged-for-review. That's the difference between malpractice risk and acceptable risk.

What gets verified

For each LLM-emitted finding, every EvidenceCitation's verbatim_quote is checked against the question's retrieval_results chunk text. The check uses verify_quote(quote, chunks):

  1. Normalize both quote and haystack (lowercase, whitespace-collapsed)
  2. Exact match: is the quote a substring of any chunk's text?
  3. Span match (paraphrase tolerance): does any 40-character (or 60% of quote, whichever is shorter) substring of the quote appear in any chunk's text?

The two-stage check tolerates minor LLM paraphrasing while still requiring real anchor text. A fully hallucinated quote — content not appearing in any retrieved chunk — fails both stages.

What happens when quotes don't verify

EvidenceCitation carries quote_verified: bool = True. The flag is set to False at parse_evidence time when verification fails.

In build_finding, _apply_evidence_verification_penalty() applies a confidence multiplier based on the unverified ratio:

Unverified ratio Penalty Auditor flag
0% None None
1% – 49% confidence × max(0.7, 1 − ratio × 0.4) None
≥ 50% confidence × 0.5 "Majority of cited quotes could not be located..." appended to auditor_notes

The 50% threshold reflects auditor judgment: a single misquoted citation is recoverable; majority-unverified evidence is a defensibility crisis the partner needs to see explicitly.

Where it surfaces

Surface Behavior
Finding.confidence Reduced per the penalty schedule
Finding.auditor_notes Appended with verification flag when ≥50% unverified
EvidenceCitation.quote_verified False on unverified
Markdown deliverable "(unverified)" marker after the doc/section anchor
JSON deliverable quote_verified: false field
Engagement summary Reduced confidence flows into severity rollup naturally

Why this is a precision move, not a recall move

Verification only flags / penalizes — it doesn't drop findings outright. A finding with unverified evidence still appears in the deliverable; the auditor decides whether to accept, refine, or reject.

This is intentional. False rejections are also costly. The auditor is the final arbiter; AuditForge's job is to surface the signal, not pre- adjudicate it. The verification system gives the auditor the data to make an informed decision.

Behavior on edge cases

  • Empty quote: verify_quote returns False. parse_evidence already filters empty quotes out.
  • Empty chunks (no retrieval): Returns False. Findings produced without retrieval (very rare — Stage E retrieves before LLM call) get flagged as unverified across the board.
  • Whitespace differences: Normalized. "annual review required" matches "annual review required."
  • Case differences: Normalized. "ANNUAL TRAINING" matches "annual training."
  • Short quotes (< 40 chars): Require exact substring match (no span fallback). A 10-character paraphrase that doesn't appear verbatim fails — desired behavior, since short paraphrases are most prone to hallucination.

Test coverage

Area Cases
verify_quote (pure) 10 (exact, case-insensitive, whitespace-normalized, paraphrase span match, hallucination rejected, empty quote, empty chunks, short-quote exact, multi-chunk match, long-quote partial)
parse_evidence flag setting 3 (verified, unverified, mixed)
_apply_evidence_verification_penalty 5 (no penalty when all verified, severe at 50%+, mild at <50%, no evidence no-op, all unverified)
build_finding integration 2 (mixed evidence applies penalty + flag, all-verified no penalty/flag)

All 20 cases passing. Full suite: 422 pass, no regressions.

Cost shape

Zero. Pure string matching, no LLM calls. This is the highest-ROI power move in the AuditForge architecture.

Public API

from app.auditforge.orchestrator import verify_quote, parse_evidence

# Pure verification
is_real = verify_quote(quote_string, retrieved_chunks)

# Parse with verification (called inside build_finding)
evidence = parse_evidence(llm_evidence_array, retrieved_chunks)
for ev in evidence:
    print(ev.verbatim_quote, ev.quote_verified)

Known limits / future work

  • No external-source verification. Quotes from cited standards (FAR/DFARS/NIST) aren't checked against eCFR or NIST publication databases. The citation_integrity_check primitive will close this gap when external verification ships.
  • Span-match heuristic is fixed. 40-char minimum / 60% threshold may need tuning per domain. Legal corpora with verbose statutory language behave differently from technical SOPs. Adaptive thresholds are a Phase 2 refinement.
  • Verification doesn't catch out-of-context quotes. A quote may match a chunk verbatim but the LLM may have applied it to the wrong finding context. Catching this requires semantic verification (an adversarial pass — see next commit).
  • No fuzzy-match (e.g., Levenshtein). Substring matching is exact; typos in either source or quote can defeat it. Acceptable trade-off given the cost vs. value.