Skip to content

Adversarial Verification

Status: ✅ Complete Files: app/auditforge/verifier.py, app/auditforge/runner.py (integration) Tests: tests/test_auditforge_verifier.py — 21 cases passing

Why this exists

The single biggest precision move in the AuditForge architecture. After Stage E produces findings, the verifier runs a second LLM pass (REASONING_HIGH, Opus 4.7) whose only job is to find reasons the finding is wrong. This is adversarial collaboration applied to audit findings: the same model family that generates findings is much better at spotting their flaws when explicitly asked to.

False positives are existential for AuditForge — one bad finding loses an account. Evidence-quote auto-verification (the previous commit) catches fabricated quotes. Adversarial verification catches overreach: cases where the quote is real but the finding's interpretation, severity, or root-cause attribution is wrong.

Together they shift the failure mode from silent error in the deliverable to flagged for auditor review.

What gets reviewed

Findings whose severity is at-or-above the threshold (default MEDIUM). LOW-severity findings skipped by default — adversarial review of every finding doubles cost while LOW findings rarely warrant the extra scrutiny. Tunable via severity_threshold parameter.

Each non-LOW finding gets one Opus call. The model receives: - The full finding (description, root_cause, severity, confidence) - The cited evidence chain (with verification status surfaced) - The proposed remediation

And produces a structured JSON verdict.

Verdict shape

@dataclass
class AdversarialVerdict:
    finding_id: str
    stands: bool                      # does the finding survive scrutiny?
    revised_confidence: float         # may be lower than original
    weakening_evidence: str           # what argues against the finding
    caveats: list[str]                # qualifications
    recommendation: str               # "keep" | "refine" | "flag_for_review" | "reject"
    raw_response: str                 # raw LLM text (for audit log)
    succeeded: bool                   # False if LLM/parse failed

How verdicts apply to findings (apply_verdict)

Verdict Confidence Auditor notes
succeeded=False (LLM/parse fail) unchanged unchanged
stands=True, no caveats, recommendation="keep" only down (never up) unchanged
stands=True, with caveats, "keep" unchanged "[Adversarial review — keep, with caveats]" + caveats
stands=False, "refine"/"flag_for_review"/"reject" revised down "[Adversarial review — recommendation: X]" + weakening_evidence + caveats

Two important rules:

  1. Confidence never increases. Even if the adversarial pass returns a higher confidence, we keep the original. This avoids inflating weak findings via favorable adversarial reads.
  2. Verdicts never silently drop findings. The auditor remains the final arbiter. A "reject" recommendation appends a flag; the auditor decides whether to actually reject. False rejections are also costly.

Integration with the runner

The runner calls verify_findings_adversarially() inside each iteration's _run_one_iteration, after Stage E completes and before findings are persisted. So persisted state always reflects verdicts.

# In _run_one_iteration:
run_result = await run_investigation(...)

if enable_adversarial_verification and run_result.findings:
    adversarial_result = await verify_findings_adversarially(
        run_result.findings, ...
    )

# Then persist
findings.extend(run_result.findings)
findings_store.replace_all(...)

This means subsequent iterations only adversarially-review NEW findings, never re-reviewing ones already verified in earlier rounds. That keeps adversarial cost linear in total findings count.

Failures in the adversarial pass are non-fatal — logged but the iteration continues without verdicts.

Runner flags

async def run_audit(
    ...,
    enable_adversarial_verification: bool = True,
    adversarial_severity_threshold: Severity = Severity.MEDIUM,
) -> AuditResult

Disable globally by passing enable_adversarial_verification=False. Loosen to verify all findings by passing Severity.LOW. Tighten to verify only critical findings by passing Severity.CRITICAL.

Cost shape

Per finding above threshold: one REASONING_HIGH (Opus) call. - Per call: ~3K input + ~1K output at Opus pricing ≈ $0.05 - Typical 80-finding audit, ~70 above threshold: ~70 × $0.05 = $3.50 - Roughly doubles total audit cost. From $7 → ~$10. Justified at the engagement price points we target — the precision gain is worth far more than $3 per audit.

Concurrency: parallel via asyncio.gather, bounded by LLMClient's semaphore. 70 calls / 20 concurrent = ~4 batches × ~10s each = ~40s added to total audit time. Not significant against a multi-minute audit.

Per-finding prompt strategy

The system prompt is unambiguously skeptical:

You are a SKEPTICAL senior auditor reviewing a finding produced by an automated audit system. Your job is to FIND REASONS THE FINDING IS WRONG or overstated. Do not confirm — challenge.

Specific examination axes the model is asked to consider: 1. Does the description match what the cited evidence actually says? (Look for overreach.) 2. Are there alternative interpretations of the evidence? 3. Is the severity level appropriate, or is it overstated? 4. Are there context gaps the finding ignores? 5. Is the root cause supported by evidence, or speculative? 6. If any cited quote is marked unverified, treat the finding as suspect.

The model is also explicitly told: "If you have ANY material concern, set stands=false even if the finding is broadly correct — the auditor needs to see your concerns." This biases toward false-positive auditor flags rather than false-negative misses, which is the right tradeoff when the auditor is doing the final review.

Why we use Opus, not Sonnet

REASONING_HIGH only. The whole point of adversarial verification is that the second pass needs to be at least as smart as the first to reliably catch overreach. Sonnet adversarial-reviewing Sonnet findings won't surface much that Sonnet didn't already consider; Opus adversarial-reviewing Sonnet findings adds genuine reasoning depth.

Even better in future would be ensemble: Sonnet primary + Opus adversarial + Opus refereeing on disagreement. Phase 2.

Test coverage

Area Cases
_severity_at_or_above 3 (above, below, threshold equal)
_render_finding_for_review 3 (severity surfaced, unverified quote marked, remediation included)
apply_verdict 7 (failed, lowers conf, never increases, doesn't-stand-flags, keep+caveats, keep+no-caveats, append-existing)
End-to-end with mocked LLM 8 (skip LOW, REASONING_HIGH tier, falsification verdicts, LLM failure isolation, parse failure, invalid recommendation, empty findings, progress callback)

All 21 cases passing. Full suite: 443 pass, no regressions.

Public API

from app.auditforge.verifier import (
    verify_findings_adversarially, AdversarialResult, Severity,
)

result = await verify_findings_adversarially(
    findings,
    llm=client,
    engagement_id="eng-X",
    severity_threshold=Severity.MEDIUM,    # skip LOW by default
    progress=progress_sink,                # optional
)
print(f"Examined: {result.findings_examined}")
print(f"Flagged for auditor: {result.findings_flagged}")
print(f"Cost: ${result.cost_cents/100:.2f}")

Known limits / future work

  • Single-model adversarial. Same family as the generator. Future: ensemble with two providers (Anthropic Opus + a non-Anthropic competitor) for genuine diversity of priors.
  • No re-retrieval. Adversarial sees only what the original finding cited. Could re-retrieve from the corpus for additional context (counter-evidence searches). Phase 2 hardening.
  • No cross-finding adversarial. Each finding reviewed in isolation. Cross-finding adversarial (e.g., "do these two findings contradict?") would catch a different class of issues. Phase 2.
  • No human-in-the-loop bias correction. When auditors consistently override the verifier in one direction, that's signal to retune. We don't yet capture that loop.
  • Doesn't help on LOW findings by default. A LOW finding that's actually a CRITICAL finding (severity misread) won't be caught. Tunable but cost-prohibitive to verify everything.