Adversarial Verification¶
Status: ✅ Complete
Files: app/auditforge/verifier.py, app/auditforge/runner.py (integration)
Tests: tests/test_auditforge_verifier.py — 21 cases passing
Why this exists¶
The single biggest precision move in the AuditForge architecture. After Stage E produces findings, the verifier runs a second LLM pass (REASONING_HIGH, Opus 4.7) whose only job is to find reasons the finding is wrong. This is adversarial collaboration applied to audit findings: the same model family that generates findings is much better at spotting their flaws when explicitly asked to.
False positives are existential for AuditForge — one bad finding loses an account. Evidence-quote auto-verification (the previous commit) catches fabricated quotes. Adversarial verification catches overreach: cases where the quote is real but the finding's interpretation, severity, or root-cause attribution is wrong.
Together they shift the failure mode from silent error in the deliverable to flagged for auditor review.
What gets reviewed¶
Findings whose severity is at-or-above the threshold (default MEDIUM).
LOW-severity findings skipped by default — adversarial review of every
finding doubles cost while LOW findings rarely warrant the extra
scrutiny. Tunable via severity_threshold parameter.
Each non-LOW finding gets one Opus call. The model receives: - The full finding (description, root_cause, severity, confidence) - The cited evidence chain (with verification status surfaced) - The proposed remediation
And produces a structured JSON verdict.
Verdict shape¶
@dataclass
class AdversarialVerdict:
finding_id: str
stands: bool # does the finding survive scrutiny?
revised_confidence: float # may be lower than original
weakening_evidence: str # what argues against the finding
caveats: list[str] # qualifications
recommendation: str # "keep" | "refine" | "flag_for_review" | "reject"
raw_response: str # raw LLM text (for audit log)
succeeded: bool # False if LLM/parse failed
How verdicts apply to findings (apply_verdict)¶
| Verdict | Confidence | Auditor notes |
|---|---|---|
succeeded=False (LLM/parse fail) |
unchanged | unchanged |
stands=True, no caveats, recommendation="keep" |
only down (never up) | unchanged |
stands=True, with caveats, "keep" |
unchanged | "[Adversarial review — keep, with caveats]" + caveats |
stands=False, "refine"/"flag_for_review"/"reject" |
revised down | "[Adversarial review — recommendation: X]" + weakening_evidence + caveats |
Two important rules:
- Confidence never increases. Even if the adversarial pass returns a higher confidence, we keep the original. This avoids inflating weak findings via favorable adversarial reads.
- Verdicts never silently drop findings. The auditor remains the final arbiter. A "reject" recommendation appends a flag; the auditor decides whether to actually reject. False rejections are also costly.
Integration with the runner¶
The runner calls verify_findings_adversarially() inside each
iteration's _run_one_iteration, after Stage E completes and before
findings are persisted. So persisted state always reflects verdicts.
# In _run_one_iteration:
run_result = await run_investigation(...)
if enable_adversarial_verification and run_result.findings:
adversarial_result = await verify_findings_adversarially(
run_result.findings, ...
)
# Then persist
findings.extend(run_result.findings)
findings_store.replace_all(...)
This means subsequent iterations only adversarially-review NEW findings, never re-reviewing ones already verified in earlier rounds. That keeps adversarial cost linear in total findings count.
Failures in the adversarial pass are non-fatal — logged but the iteration continues without verdicts.
Runner flags¶
async def run_audit(
...,
enable_adversarial_verification: bool = True,
adversarial_severity_threshold: Severity = Severity.MEDIUM,
) -> AuditResult
Disable globally by passing enable_adversarial_verification=False.
Loosen to verify all findings by passing Severity.LOW.
Tighten to verify only critical findings by passing Severity.CRITICAL.
Cost shape¶
Per finding above threshold: one REASONING_HIGH (Opus) call. - Per call: ~3K input + ~1K output at Opus pricing ≈ $0.05 - Typical 80-finding audit, ~70 above threshold: ~70 × $0.05 = $3.50 - Roughly doubles total audit cost. From $7 → ~$10. Justified at the engagement price points we target — the precision gain is worth far more than $3 per audit.
Concurrency: parallel via asyncio.gather, bounded by LLMClient's
semaphore. 70 calls / 20 concurrent = ~4 batches × ~10s each = ~40s
added to total audit time. Not significant against a multi-minute
audit.
Per-finding prompt strategy¶
The system prompt is unambiguously skeptical:
You are a SKEPTICAL senior auditor reviewing a finding produced by an automated audit system. Your job is to FIND REASONS THE FINDING IS WRONG or overstated. Do not confirm — challenge.
Specific examination axes the model is asked to consider: 1. Does the description match what the cited evidence actually says? (Look for overreach.) 2. Are there alternative interpretations of the evidence? 3. Is the severity level appropriate, or is it overstated? 4. Are there context gaps the finding ignores? 5. Is the root cause supported by evidence, or speculative? 6. If any cited quote is marked unverified, treat the finding as suspect.
The model is also explicitly told: "If you have ANY material concern, set stands=false even if the finding is broadly correct — the auditor needs to see your concerns." This biases toward false-positive auditor flags rather than false-negative misses, which is the right tradeoff when the auditor is doing the final review.
Why we use Opus, not Sonnet¶
REASONING_HIGH only. The whole point of adversarial verification is that the second pass needs to be at least as smart as the first to reliably catch overreach. Sonnet adversarial-reviewing Sonnet findings won't surface much that Sonnet didn't already consider; Opus adversarial-reviewing Sonnet findings adds genuine reasoning depth.
Even better in future would be ensemble: Sonnet primary + Opus adversarial + Opus refereeing on disagreement. Phase 2.
Test coverage¶
| Area | Cases |
|---|---|
_severity_at_or_above |
3 (above, below, threshold equal) |
_render_finding_for_review |
3 (severity surfaced, unverified quote marked, remediation included) |
apply_verdict |
7 (failed, lowers conf, never increases, doesn't-stand-flags, keep+caveats, keep+no-caveats, append-existing) |
| End-to-end with mocked LLM | 8 (skip LOW, REASONING_HIGH tier, falsification verdicts, LLM failure isolation, parse failure, invalid recommendation, empty findings, progress callback) |
All 21 cases passing. Full suite: 443 pass, no regressions.
Public API¶
from app.auditforge.verifier import (
verify_findings_adversarially, AdversarialResult, Severity,
)
result = await verify_findings_adversarially(
findings,
llm=client,
engagement_id="eng-X",
severity_threshold=Severity.MEDIUM, # skip LOW by default
progress=progress_sink, # optional
)
print(f"Examined: {result.findings_examined}")
print(f"Flagged for auditor: {result.findings_flagged}")
print(f"Cost: ${result.cost_cents/100:.2f}")
Known limits / future work¶
- Single-model adversarial. Same family as the generator. Future: ensemble with two providers (Anthropic Opus + a non-Anthropic competitor) for genuine diversity of priors.
- No re-retrieval. Adversarial sees only what the original finding cited. Could re-retrieve from the corpus for additional context (counter-evidence searches). Phase 2 hardening.
- No cross-finding adversarial. Each finding reviewed in isolation. Cross-finding adversarial (e.g., "do these two findings contradict?") would catch a different class of issues. Phase 2.
- No human-in-the-loop bias correction. When auditors consistently override the verifier in one direction, that's signal to retune. We don't yet capture that loop.
- Doesn't help on LOW findings by default. A LOW finding that's actually a CRITICAL finding (severity misread) won't be caught. Tunable but cost-prohibitive to verify everything.