Stage F — Deepen¶
Status: ✅ Complete
File: app/auditforge/findings.py (Stage F + Finding data model)
Tests: tests/test_auditforge_findings.py — 16 cases passing
Purpose¶
Stage F is what turns N tactical findings into M strategic insights. After Stage E produces a list of findings, Stage F:
- Clusters findings by shared evidence chunks (deterministic, free)
- Refines clusters by root_cause embedding similarity (catches cases where two findings share semantic root cause but no overlapping chunks)
- Rolls up severity within clusters; escalates by one tier when ≥3 findings share a root cause
- Detects cross-finding patterns via LLM (REASONING_HIGH) — the one place we want maximum reasoning quality, because patterns are what get top billing in the deliverable
- Generates follow-up targets for the next iteration round (REASONING_MID)
The deepening LOOP is what makes AuditForge produce defensible audits at scale: one round finds tactical issues; subsequent rounds investigate adjacent areas suggested by what was already found. Iteration cap or budget cap stops the loop.
Output: DeepenResult¶
@dataclass
class DeepenResult:
clusters: list[FindingCluster] # findings grouped by relation
patterns: list[str] # cross-finding pattern descriptions
follow_up_targets: list[FollowUpTarget] # next-round catalog targets
@dataclass
class FindingCluster:
cluster_id: str # cl-{8 hex chars}
finding_ids: list[str]
shared_chunk_ids: list[str] # evidence overlap
rolled_up_severity: Severity # max of constituent severities, escalated at 3+
pattern_description: str | None # filled by LLM if a pattern matches
pattern_remediation_focus: str | None
@dataclass
class FollowUpTarget:
primitive: str # one of the six primitives
description: str # natural-language target description
parent_finding_ids: list[str] # which findings spawned this target
priority_hint: float # 0-1; informs Stage B priority
Pipeline¶
findings (from Stage E)
│
▼
1. cluster_by_evidence (pure)
- Greedy grouping by shared chunk_ids ≥ min_shared_chunks (default 1)
- Severity rollup: max(severities) within cluster
- Escalate by one tier if cluster has ≥3 findings (systemic signal)
│
▼
2. refine_clusters_by_similarity (pure + embeddings)
- Embed root_cause text (fallback to description) per cluster representative
- Pairwise cosine similarity ≥ similarity_threshold (default 0.85) → merge
- Catches semantic root-cause alignment that didn't share chunks
│
▼
3. annotate_related_findings
- Populate Finding.related_finding_ids in-place from cluster membership
│
▼
4. _detect_patterns (LLM, REASONING_HIGH) ┐
│ asyncio.gather
4. _generate_followup_targets (LLM, REASONING_MID)┘
│
▼
5. Attach pattern descriptions to clusters by finding_id overlap
│
▼
DeepenResult
Severity rollup logic¶
sev_rank = {LOW: 1, MEDIUM: 2, HIGH: 3, CRITICAL: 4}
rolled_severity = max(sev_rank[f.severity] for f in cluster_findings)
if len(cluster_findings) >= 3:
rolled_severity = min(4, rolled_severity + 1) # escalate one tier
3 MEDIUM findings sharing a root cause becomes a HIGH cluster. 3 HIGH becomes CRITICAL. The escalation reflects auditor judgment: a single medium issue is one ticket; three of the same medium issue is a systemic problem deserving partner attention.
LLM pattern detection¶
The pattern-detection prompt is the one place we use REASONING_HIGH (Opus 4.7). Patterns are what get top billing in the deliverable — the firm's partner reads them first, signs off on them, and uses them to drive the remediation conversation with their client.
The prompt:
You are a senior auditor reviewing findings from a deep audit. Identify
cross-cutting patterns — situations where multiple findings stem from a
single root cause or systemic issue. A pattern matters when N findings
reduce to 1 explanation.
Output STRICT JSON: {"patterns":[{"description","finding_ids","remediation_focus"}]}
Up to 8 patterns.
Input is the serialized findings + clusters (capped at 60 findings to bound prompt cost; for very large audits we'd need to do hierarchical synthesis — Phase 2 hardening).
Follow-up target generation¶
Same input, REASONING_MID tier (cheaper because the output is more mechanical). Generates up to 20 targets per primitive, each with a priority_hint that Stage B's catalog generator weighs in the next iteration.
Invalid primitive names (LLM hallucination) are dropped — only the six known primitives pass through.
Cost shape¶
Per deepen pass: - 1 REASONING_HIGH call (pattern detection): ~$0.10–0.30 depending on prompt size - 1 REASONING_MID call (follow-up targets): ~$0.04 - Total per deepen pass: ~$0.15–0.35
3 iteration rounds → 2 deepen passes (no deepen after final round) → ~$0.30–0.70. Negligible compared to Stage E's investigation cost.
Failure isolation¶
Each LLM call wraps in try/except. Failures return empty lists rather than aborting the whole stage:
- Pattern detection fails → no patterns, but clustering is preserved
- Follow-up target generation fails → no follow-ups, but iteration loop caller can choose to retry or exit the loop
The clustering steps are pure (no LLM); they always succeed.
Test coverage¶
| Area | Cases |
|---|---|
cluster_by_evidence |
8 (empty, isolated, shared chunks, severity max, 3+ escalation, critical cap, threshold, no chunks) |
refine_clusters_by_similarity |
4 (orthogonal no-merge, similar merge, description fallback, no signal short-circuit) |
annotate_related_findings |
1 (in-place population) |
End-to-end cluster_and_deepen |
3 (full pipeline mocked LLM, empty findings, invalid primitive dropped) |
All 16 cases passing. Full suite: 361 pass, no regressions.
Public API¶
async def cluster_and_deepen(
engagement_id: str,
findings: list[Finding],
*,
llm: LLMClient,
embedder: Embedder | None = None,
min_shared_chunks: int = 1,
similarity_threshold: float = 0.85,
) -> DeepenResult
How the iteration loop wires up¶
The deepening loop lives at the engagement-runner level (not inside Stage F). The pattern is:
catalog = await build_catalog(profile, intake, archetype, [], 0, llm)
questions = await synthesize_questions(catalog, intake, archetype, llm)
validated = await validate_questions(questions, ...)
result = await run_investigation(engagement, validated.questions, ...)
for iteration in range(1, max_iterations + 1):
deepen = await cluster_and_deepen(engagement.id, result.findings, llm=llm)
if not deepen.follow_up_targets:
break
if budget.utilization > convergence_budget_pct:
break
# Next iteration: build catalog using prior findings + follow-up hints
catalog = await build_catalog(
profile, intake, archetype, result.findings, iteration, llm,
)
questions = await synthesize_questions(catalog, intake, archetype, llm)
validated = await validate_questions(questions, ...)
result = await run_investigation(engagement, validated.questions, ...)
The runner is built in a future commit (probably alongside Stage G's deliverable generation).
Known limits / future work¶
- Pattern detection capped at 60 findings. Large engagements (200+ findings) need hierarchical pattern synthesis: cluster findings into buckets, summarize each bucket, run pattern detection over summaries. Phase 2 hardening.
- Follow-up targets feed Stage B descriptively, not structurally. The next-round catalog uses follow-up descriptions as hints, not pre-built typed targets. Stage B re-runs full LLM generation. Could optimize to skip primitives with no follow-up targets, saving cost.
- No iteration convergence detector. Stage F doesn't decide when to stop iterating — that's the runner's job. Could add a convergence signal (e.g., new findings rate dropped below threshold).