Skip to content

AuditForge Methodology

A defensible, evidence-anchored methodology for LLM-driven deep audit of contract, policy, and compliance corpora.


Audience and purpose

This document is written for the procurement, compliance, and risk-review teams at the end client of an audit/advisory engagement. When a partner firm delivers an audit produced with AuditForge, this paper answers the question their client's reviewer asks: how did this tool actually arrive at its findings, and on what basis can we rely on them?

It is also the methodology reference for the auditing partner themselves — the answers they need on hand when their client asks the same question by phone.


What AuditForge does

AuditForge ingests a corpus of contract, policy, SOP, attestation, and procedural documents and produces a structured audit deliverable: a set of evidence-anchored findings, each tied to verbatim source quotes, with severity ratings, root-cause framing, and remediation scopes.

The system is corpus-agnostic. It is not preloaded with rules for any specific framework — NIST SP 800-171, SOC 2, HIPAA, FedRAMP, CMMC, FAR/DFARS, ABA Model Rules, ISO 27001 — and it does not attempt to encode auditor expertise. Domain knowledge enters the audit through three deliberate channels:

  1. The corpus itself. The documents being audited carry the regulatory frameworks they reference, the obligations they impose, and the language they use. The system extracts and reasons over this content.
  2. The engagement intake. Before the run starts, the auditing partner specifies the engagement domain, the frameworks in scope, focus areas, materiality, known concerns, and document hierarchy. This intake constrains and steers the run.
  3. The auditor's review. Every finding produced is reviewed and rated by a senior reviewer at the audit firm before the deliverable is finalized. AuditForge produces candidate findings; the auditor accepts, refines, or rejects them.

The system is explicitly not an autonomous decision-maker. It is a structured first-pass review tool that surfaces issues a partner-led review then validates. The deliverable that reaches the end client is the partner's deliverable, with their firm's name and reputation attached.


Pipeline overview

The audit follows a seven-stage pipeline. Stages run in sequence; within stages, work is parallelized across thousands of LLM calls with budget governance and rate limiting.

A. Profile     →  B. Catalog     →  C. Synthesize
G. Report  ←   F.5. Filter   ←   F. Deepen   ←   E. Investigate   ←   D. Validate
                                  E.5 Consolidate (between F and the next deepening pass)

Each stage is described in detail below.


Stage A — Profile

The profiler builds a structural model of the corpus before any audit reasoning begins. It computes:

  • Metadata distribution. Document types (contract, subcontract, policy, SOP, memo, attestation, procedure, training record), jurisdictions (when present), authoring dates, and revision indicators.
  • Cluster topology. Documents are embedded and clustered. Clusters represent thematic regions of the corpus. The cluster shape informs which retrievals will be cheap (intra-cluster) and which require cross-cluster reasoning.
  • Citation graph. Every clause-level reference (FAR clause numbers, NIST SP/IR designations, regulatory citations, internal §-references) is extracted and normalized at ingest time. The graph identifies which documents cite which, which authorities are referenced, and which references appear without corresponding cited content.
  • Stratified sample. A small representative slice of the corpus, drawn from each cluster, used as concrete-detail context for downstream reasoning.

The profile is written once per engagement and reused across iterations. Its purpose is to give every downstream stage a corpus-aware foundation: the catalog stage knows what kinds of documents exist; the synthesizer knows which clusters to draw retrievals from; the report stage can reference document distributions in its metadata appendix.


Stage B — Catalog

The catalog stage produces ranked target lists per audit primitive. A primitive is an audit reasoning template — a structured way to interrogate a corpus. AuditForge currently implements ten primitives, described in the next section. For each primitive, the catalog stage uses an LLM call (Sonnet-class reasoning) over the profile output, the engagement intake, and any prior findings to generate ranked targets:

  • For conflict_check: a ranked concept inventory — concepts on which the corpus might contain contradictions.
  • For coverage_check: a checklist of required-element categories the corpus should contain given the frameworks in scope.
  • For currency_check: rules describing which references in the corpus are subject to supersession, with stale-after dates where applicable.
  • For flow_down_check: doc-type pairs where parent obligations should be reflected in child documents.
  • For consistency_check: defined terms that may be used differently across documents.
  • For citation_integrity_check: regex-extracted citation tuples for verification (FAR/DFARS/NIST/CFR/USC patterns).
  • For temporal_check: temporal relations the corpus expresses (X must precede Y).
  • For quantitative_check: quantitative facts the corpus asserts (deadlines, counts, percentages, dollar thresholds).
  • For obligation_check: obligation checkpoints with asymmetry hints (e.g., one party has audit rights but the other does not).
  • For ambiguity_check: ambiguity checkpoints (vague language likely to cause downstream disputes).

The catalog deliberately over-generates targets. Downstream stages filter; the cost of a missed target is far worse than the cost of an investigated and discarded target.


Stage C — Synthesize

The synthesizer turns each (primitive, target) pair into a concrete, scoped, auditable question. Each question carries:

  • A primitive identifying the reasoning template that produced it.
  • A scope — which document types, sections, frameworks, and date ranges to retrieve from.
  • A prompt_template — the LLM prompt to apply.
  • An expected_evidence_shape — what counts as a valid finding from this question.
  • A severity_weight and archetype_weight — how to prioritize this question relative to others.

Questions are sorted by combined priority. The Validate stage will further filter; the Investigate stage processes them in priority order so that even if budget runs out, the highest-priority work is done.


Stage D — Validate

Every synthesized question is validated before execution to avoid spending compute on questions the corpus cannot answer. Validation runs:

  • Cheap relevance check. A small retrieval pass against the corpus to verify the question has any plausible source content. Questions with no relevant chunks are dropped.
  • Near-duplicate dedupe. Questions that overlap in content with already-validated questions are dropped — the corpus only needs to be asked once.
  • Adaptive thresholds per archetype. Premium / Defensibility runs use tighter thresholds than Continuous Monitoring; the partner can override at engagement creation.

The validator's filter rate is reported with the deliverable. A high drop rate in a healthy run is normal: many catalog targets describe ideal-case content that does not exist in the actual corpus, and the validator correctly avoids investigating their absence as if they were present.


Stage E — Investigate

Each validated question is executed asynchronously. Per question, the investigate stage:

  1. Retrieves relevant corpus chunks via the system's hybrid FAISS + BM25 retriever, with archetype-specific re-ranking.
  2. Reasons over the retrievals using the primitive's prompt template.
  3. Anchors the resulting finding to verbatim quotes from the retrieved chunks. Every claim must be traceable to one or more verbatim spans of source text.
  4. Optionally performs a follow-up retrieval round when the first pass surfaces a hint that a specific other document type would clarify or refute the finding.

Findings produced at this stage are raw. Each is a single primitive's view of a specific question's evidence. The same underlying issue may surface from multiple primitives and produce multiple raw findings. This redundancy is intentional and is resolved at the consolidation stage.

After a finding is produced, an adversarial verification pass runs: a separate LLM call (Opus-class reasoning) acts as a skeptical reviewer, examining the evidence and the claim. The verifier can refine the finding (adjust severity, sharpen the description), flag it for partner review, or recommend that the partner reconsider it. The verifier never silently rejects findings — every original finding survives to the partner's review queue, with the verifier's commentary attached.


Stage E.5 — Consolidate

Run between iterations of the deepening loop, the consolidator clusters raw findings by underlying root cause. Different primitives often produce different views of the same real-world issue (a NIST revision-version mismatch can surface from citation_integrity_check, currency_check, and consistency_check simultaneously). The consolidator:

  1. Asks an Opus-class LLM to cluster raw findings by shared root cause, ignoring superficial framing differences.
  2. Merges each cluster into a single canonical finding, preserving lineage by recording the merged-finding IDs.
  3. Reclassifies the canonical finding's primitive when the underlying root cause is better described by a different primitive than the one that produced the source findings.
  4. Applies a corroboration confidence boost: when N independent primitives produced findings about the same root cause, the canonical's confidence is boosted as max(individual) + 0.03 × (N - 1), capped at 1.0. Cross-primitive agreement is treated as a positive signal, not noise.

The raw findings are preserved on the engagement record for audit-trail purposes. The deliverable renders only canonical findings.


Stage F — Deepen

After consolidation, the deepening stage:

  1. Identifies systemic patterns across canonical findings — multiple findings that share a single structural root cause (e.g., "no contract-hierarchy reconciliation step exists, so subcontracts and policies drift independently from the master contract"). These patterns are written as cross-cutting narrative paragraphs in the deliverable.
  2. Generates follow-up targets for the next iteration when the current iteration's findings suggest under-explored areas.

The deepening loop runs for up to a configurable maximum iteration count, with a per-iteration budget cap that automatically downshifts to less expensive model tiers as the engagement budget approaches its limit.


Stage F.5 — Filter

Before the deliverable is rendered, every canonical finding passes through a two-pass filter:

  • Pass 1 (LLM classification). An Opus-class call rates each finding as definitive, likely, speculative, or rejected. The classification considers the strength of the evidence, the cleanness of the inference from evidence to claim, and the absence of counter-evidence in the retrieved context. The reject criterion is restrictive — uncertain findings default to speculative, never rejected.
  • Pass 2 (override ruleset, upgrade-only). A pure-code ruleset can upgrade a Pass-1 reject decision when the finding meets specific corroboration thresholds: cross-primitive agreement score ≥ 0.7, ≥ 3 verbatim quotes from ≥ 3 distinct documents, or a regulatory-pattern match (FAR / DFARS / NIST / CFR / HIPAA / ABA / SOC 2 / GDPR / SOX). The ruleset cannot downgrade — only protect against false negatives.

Findings rated rejected after both passes are placed in a "Considered but rejected" appendix to the deliverable, with the filter rationale shown. The partner reviewer can override any classification.


Stage G — Report

The report stage assembles the deliverable:

  • Cover page with the auditing firm's name, logo, tagline, confidentiality notice, and engagement metadata.
  • Executive summary — an LLM-generated synthesis of the audit's most material findings, the structural failures behind them, and the priority remediation actions.
  • At-a-glance severity counts.
  • Systemic patterns — the cross-cutting narrative paragraphs from Stage F.
  • Remediation roadmap — every finding's remediation framing aggregated into a roadmap table with hour estimates, ordered by severity, ready to scope a follow-up engagement.
  • Detailed findings — each finding's full description, root cause, evidence chain (verbatim quotes anchored to source documents), and remediation framing.
  • Considered-but-rejected appendix — findings that did not meet the filter's bar, with rationales.
  • Methodology — this document, plus any firm-specific methodology disclaimer.
  • Engagement metadata — engagement ID, firm, client, archetype, status, compute spend.

The deliverable is rendered in three formats: structured JSON (for downstream tooling), Markdown (for inline review), and DOCX (for editing in Word and delivery to the end client).


The ten primitives

Primitive What it looks for Evidence shape
conflict_check Two or more documents that take contradictory positions on the same concept (definitions, requirements, authorities). Quote A from doc 1, quote B from doc 2, narrative of the contradiction.
consistency_check A defined term used with different meanings or scopes across documents. Quotes showing each definition variant; explanation of the inconsistency.
coverage_check A required document, policy, or control category is absent from the corpus given the frameworks in scope. Negative-evidence check (no relevant retrievals on the expected element); intake-frameworks list as basis.
currency_check A reference to a superseded standard, stale clause language, or out-of-date authority that has been formally updated. Quote showing the stale reference; identification of the superseding version.
flow_down_check An obligation in a parent contract that is required to be incorporated into subcontracts or implementing policies but is not. Parent obligation quote; child document content (or absence) that should reflect it.
citation_integrity_check A document cites a clause, regulation, or standard that does not say what the citing document claims. Cited authority quoted; citing document's claim quoted; discrepancy explained.
temporal_check A required temporal precedence (X must complete before Y) is violated or unconstrained in the corpus. Quotes establishing the temporal relation; evidence of violation or absent-constraint.
quantitative_check A quantitative fact (deadline, count, dollar threshold, percentage) is internally inconsistent or violates an external standard. Quotes establishing the asserted quantity; explanation of the inconsistency.
obligation_check An obligation is one-sided when bilateral coverage is expected, or a checkpoint is missing. Quotes showing the asymmetry; explanation of the gap.
ambiguity_check Language ambiguous enough to cause downstream dispute (mid-clause undefined-term substitutions, conditional triggers without defined conditions, vague "as appropriate" / "reasonable" qualifiers in obligation contexts). Ambiguous quote; explanation of how the ambiguity could be exploited or disputed.

New primitives can be added as composable building blocks. Existing primitives can be selectively disabled per engagement archetype.


Engagement archetypes

Four archetypes select different tunings of the same engine. The archetype is chosen at engagement creation:

Archetype Intake emphasis Catalog priority weighting Finding framing Validator strictness
Capability + Leverage Broad scope, junior-staff readability Balanced across primitives Comprehensive, with every primitive's findings represented Standard
Remediation Pipeline Client's known concerns, materiality Higher weight to coverage / flow-down / obligation primitives Each finding sized as a discrete remediation scope of work Standard
Premium / Defensibility Methodology transparency, regulatory framework, evidence-quality requirements Higher weight to citation-integrity / consistency / conflict primitives Evidence-rich, defensibility-anchored, every claim with three or more independent corroborations where possible Tighter
Continuous Monitoring Re-audit cadence, delta thresholds Higher weight to currency / temporal primitives Period-over-period delta findings against a prior audit baseline Looser (favors recall over precision)

Archetypes are configurable, not hardcoded. A partner may further override individual settings at engagement creation.


Defensibility

Every finding the deliverable contains rests on three anchors:

  1. Verbatim source quotes. Every claim cites verbatim text from the corpus. Quotes are not paraphrased and are stored at ingest time in chunked form, so the source string at audit time is byte-for-byte identical to the original document.
  2. Source location anchors. Every quote carries the document title, section header, page number (when available), and the chunk identifier in the audit log. Reviewers can trace any quote to its location in the source PDF in seconds.
  3. Reasoning trails. Every finding records the parent question it came from, the primitive that produced it, the source raw findings (when consolidation merged them), and the auditor's accept/reject/refine decision. The audit log records every LLM call: model, prompt, response, token counts, cost, and timestamp. The full trail is preserved per engagement and can be exported on request.

When a client's reviewer disputes a finding, the auditor can produce the reasoning trail end-to-end. When a client's reviewer asks why a particular issue was not surfaced, the auditor can show the validator drop log and confirm whether the relevant question was asked. Nothing about the audit is opaque.


What AuditForge does not do

  • It does not replace the senior partner. Every finding is reviewed by a partner-level auditor before the deliverable goes to the client. The partner accepts, refines, or rejects; AuditForge surfaces candidates.
  • It does not give legal advice. Findings describe what the corpus says and where its language has internal inconsistencies, gaps relative to its referenced frameworks, or stale references. Legal interpretation of consequences is the partner's responsibility.
  • It does not certify compliance. A clean run does not mean an organization is compliant with a framework. It means the documents in the corpus, as supplied, did not contain the kinds of textual issues this audit was scoped to find.
  • It does not handle classified material. All processing happens in the auditing firm's tenant within commercial cloud infrastructure. Engagements requiring SCIF-level handling are out of scope.
  • It does not replace primary research. When a finding turns on a fact not in the corpus (e.g., what NIST SP 800-171 Revision 3 actually contains), the partner verifies that fact against the authoritative source. AuditForge identifies the question; the partner confirms the answer.

Limitations of AI reasoning, transparently stated

LLM-based reasoning can fail in ways traditional rule engines cannot. Where this audit could plausibly be wrong, here is where:

  • Hallucinated quotes. Modern frontier models occasionally fabricate quotes. AuditForge mitigates this with quote_verified checks (every produced quote is matched against the retrieved chunks), and the partner reviewer is asked to confirm any high-severity finding's quote against the source PDF before accepting.
  • Context-window cliff. When the corpus exceeds the model's effective context, retrieved chunks are sampled and ranked, not exhaustively read. A relevant clause that doesn't surface in the top-k retrieval may be missed. The validator stage's relevance threshold and the deepening stage's follow-up targets reduce, but do not eliminate, this risk.
  • Domain-knowledge gaps. The model's knowledge of niche regulatory frameworks may be incomplete or out of date relative to the model's training cutoff. The intake stage and the partner's review compensate for this; novel or rapidly-evolving frameworks (e.g., a regulation changed within the last six months) warrant additional partner scrutiny.
  • Catalog blind spots. A primitive that does not exist cannot generate findings. The current ten primitives are believed to cover the major categories of textual audit finding, but a corpus may contain issues that fit none of them. The partner's review is the safety net.

The deliverable's executive summary explicitly notes when the audit's scope or budget caused specific kinds of findings to be limited.


Cost and scaling

A typical engagement against a corpus of 100–500 documents runs in 30–60 minutes of wall-clock time and consumes between $5 and $50 in compute, with the variance driven by:

  • Corpus size (linear in document count for retrieval; sub-linear for catalog and synthesis).
  • Iteration count (default 3; deeper engagements can extend to 5).
  • Adversarial verification (roughly doubles per-question cost; default on for Premium / Defensibility, optional otherwise).
  • Archetype's catalog priority weighting (some archetypes pull more LLM-driven targets than others).

Hard budget caps prevent runaway spend. The system gracefully downshifts model tiers as the budget approaches its cap rather than aborting mid-run.


Provenance & rotation

The pipeline runs against the auditing firm's own AWS tenancy. No data leaves the firm's infrastructure for any purpose other than the LLM calls themselves. LLM API endpoints used (Anthropic, OpenAI) are documented in the engagement audit log. Per-engagement S3 buckets isolate every client's documents, indexes, findings, and audit logs from every other client's. Retention is configurable per engagement and defaults to the auditing firm's standard data-handling policy.


This methodology is part of the published audit deliverable. It is intended to be read by the end client's reviewer and shared with their compliance and procurement teams. Updates to the methodology are versioned; the version applied to any given audit is recorded in the engagement metadata at the bottom of the deliverable.