Stage E.5 — Consolidate¶

What and why¶

Stages B–E are deliberately permissive: every primitive over-generates targets, every catalog over-produces, every question runs to completion. The cost of a missed finding is far worse than the cost of an investigated one. The result is that the same underlying issue often surfaces from multiple primitives — the NIST SP 800-171 r2/r3 supersession can produce findings simultaneously from citation_integrity_check, currency_check, and consistency_check. The recall benefit is real (three angles all confirming), but seven raw findings about one root cause is bad UX, distorts severity, and bloats the deliverable.

Stage E.5 fixes this. It runs once after all investigate iterations and does four things:

Cluster raw findings by underlying root cause using an Opus call (not just evidence-overlap; same root cause may cite different documents).
Merge each cluster into a single canonical Finding, preserving lineage via merged_finding_ids.
Reclassify the canonical's primitive when the cluster's underlying issue better fits a different primitive than the source findings (citation_integrity_check findings about supersession get correctly tagged currency_check).
Boost confidence when N independent primitive angles produced findings about the same root cause: max(per-finding confidence) + 0.03 × (N - 1), capped at 1.0. Cross-primitive corroboration is a signal, not noise.

Output¶

A list of canonical Finding objects (is_canonical=True). The raw findings are preserved on the engagement (audit log retains provenance) but the canonical set is what Stage F deepens, F.5 filters, and G renders.

Failure modes & fallbacks¶

Failure	Fallback
Opus LLM call fails (rate limit, parse error, network)	Orphan path — each raw finding is wrapped as its own canonical (1:1, no merging, no boost). `succeeded=False` on the result.
Truncation of LLM output mid-array	Tolerant JSON parser closes the array at the last fully-formed element and salvages partial output.
LLM proposes invalid primitive	Filtered against `_VALID_PRIMITIVES` set; falls back to the source finding's primitive.
LLM proposes severity below the source max	Severity floor enforced — canonical never goes below the highest source severity.

Cost¶

1 Opus call per audit. Bumped from 4000 to 12000 max_tokens to fit canonical outputs for large finding sets without truncating.

Code¶

app/auditforge/consolidation.py — main implementation
tests/test_auditforge_consolidation.py — 14 tests covering happy path, orphan fallback, severity floor, evidence-union, primitive validation, confidence boost math

Architectural pivot story¶

Originally the pipeline was "discriminate at catalog time" — make the catalog stage extra-precise so each primitive only produces targets that genuinely fit it. Live testing revealed that this was the wrong tradeoff: the cost of false negatives (missing real findings) was far higher than the cost of duplicates. The user explicitly authorized aggressive over-querying ("we can over query the hell out of the thing — if it costs $100 to run a real audit, that's peanuts compared to what we'd be able to charge for it").

The pivot: permissive funnel + aggressive downstream filter. Stage E.5 is the consolidation half; Stage F.5 is the filter half. Cross-primitive agreement on the same root cause is treated as the strongest possible corroboration signal.