Evidence-Quote Auto-Verification¶
Status: ✅ Complete
Spans: app/auditforge/orchestrator.py (verify_quote, parse_evidence,
build_finding), app/auditforge/findings.py (EvidenceCitation),
app/auditforge/report.py (markdown rendering)
Tests: tests/test_auditforge_evidence_verification.py — 20 cases passing
Why this exists¶
Every finding's defensibility rests on its evidence chain. If the LLM hallucinates a quote, the firm's partner unknowingly stakes their name on a fabrication. One bad quote in a deliverable can lose an account.
Evidence-quote auto-verification is a precision-side power move: every LLM-cited quote is substring-matched against the retrieved chunk text before delivery. Quotes that can't be verified flag the finding for auditor review and penalize the finding's confidence score.
This is near-free precision — pure-Python string matching, no LLM calls — but it shifts the failure mode from silent fabrication to flagged-for-review. That's the difference between malpractice risk and acceptable risk.
What gets verified¶
For each LLM-emitted finding, every EvidenceCitation's
verbatim_quote is checked against the question's retrieval_results
chunk text. The check uses verify_quote(quote, chunks):
- Normalize both quote and haystack (lowercase, whitespace-collapsed)
- Exact match: is the quote a substring of any chunk's text?
- Span match (paraphrase tolerance): does any 40-character (or 60% of quote, whichever is shorter) substring of the quote appear in any chunk's text?
The two-stage check tolerates minor LLM paraphrasing while still requiring real anchor text. A fully hallucinated quote — content not appearing in any retrieved chunk — fails both stages.
What happens when quotes don't verify¶
EvidenceCitation carries quote_verified: bool = True. The flag is
set to False at parse_evidence time when verification fails.
In build_finding, _apply_evidence_verification_penalty() applies a
confidence multiplier based on the unverified ratio:
| Unverified ratio | Penalty | Auditor flag |
|---|---|---|
| 0% | None | None |
| 1% – 49% | confidence × max(0.7, 1 − ratio × 0.4) | None |
| ≥ 50% | confidence × 0.5 | "Majority of cited quotes could not be located..." appended to auditor_notes |
The 50% threshold reflects auditor judgment: a single misquoted citation is recoverable; majority-unverified evidence is a defensibility crisis the partner needs to see explicitly.
Where it surfaces¶
| Surface | Behavior |
|---|---|
Finding.confidence |
Reduced per the penalty schedule |
Finding.auditor_notes |
Appended with verification flag when ≥50% unverified |
EvidenceCitation.quote_verified |
False on unverified |
| Markdown deliverable | "(unverified)" marker after the doc/section anchor |
| JSON deliverable | quote_verified: false field |
| Engagement summary | Reduced confidence flows into severity rollup naturally |
Why this is a precision move, not a recall move¶
Verification only flags / penalizes — it doesn't drop findings outright. A finding with unverified evidence still appears in the deliverable; the auditor decides whether to accept, refine, or reject.
This is intentional. False rejections are also costly. The auditor is the final arbiter; AuditForge's job is to surface the signal, not pre- adjudicate it. The verification system gives the auditor the data to make an informed decision.
Behavior on edge cases¶
- Empty quote:
verify_quotereturns False.parse_evidencealready filters empty quotes out. - Empty chunks (no retrieval): Returns False. Findings produced without retrieval (very rare — Stage E retrieves before LLM call) get flagged as unverified across the board.
- Whitespace differences: Normalized. "annual review required" matches "annual review required."
- Case differences: Normalized. "ANNUAL TRAINING" matches "annual training."
- Short quotes (< 40 chars): Require exact substring match (no span fallback). A 10-character paraphrase that doesn't appear verbatim fails — desired behavior, since short paraphrases are most prone to hallucination.
Test coverage¶
| Area | Cases |
|---|---|
verify_quote (pure) |
10 (exact, case-insensitive, whitespace-normalized, paraphrase span match, hallucination rejected, empty quote, empty chunks, short-quote exact, multi-chunk match, long-quote partial) |
parse_evidence flag setting |
3 (verified, unverified, mixed) |
_apply_evidence_verification_penalty |
5 (no penalty when all verified, severe at 50%+, mild at <50%, no evidence no-op, all unverified) |
build_finding integration |
2 (mixed evidence applies penalty + flag, all-verified no penalty/flag) |
All 20 cases passing. Full suite: 422 pass, no regressions.
Cost shape¶
Zero. Pure string matching, no LLM calls. This is the highest-ROI power move in the AuditForge architecture.
Public API¶
from app.auditforge.orchestrator import verify_quote, parse_evidence
# Pure verification
is_real = verify_quote(quote_string, retrieved_chunks)
# Parse with verification (called inside build_finding)
evidence = parse_evidence(llm_evidence_array, retrieved_chunks)
for ev in evidence:
print(ev.verbatim_quote, ev.quote_verified)
Known limits / future work¶
- No external-source verification. Quotes from cited standards
(FAR/DFARS/NIST) aren't checked against eCFR or NIST publication
databases. The
citation_integrity_checkprimitive will close this gap when external verification ships. - Span-match heuristic is fixed. 40-char minimum / 60% threshold may need tuning per domain. Legal corpora with verbose statutory language behave differently from technical SOPs. Adaptive thresholds are a Phase 2 refinement.
- Verification doesn't catch out-of-context quotes. A quote may match a chunk verbatim but the LLM may have applied it to the wrong finding context. Catching this requires semantic verification (an adversarial pass — see next commit).
- No fuzzy-match (e.g., Levenshtein). Substring matching is exact; typos in either source or quote can defeat it. Acceptable trade-off given the cost vs. value.