Stage F — Deepen¶

Status: ✅ Complete File: app/auditforge/findings.py (Stage F + Finding data model) Tests: tests/test_auditforge_findings.py — 16 cases passing

Purpose¶

Stage F is what turns N tactical findings into M strategic insights. After Stage E produces a list of findings, Stage F:

Clusters findings by shared evidence chunks (deterministic, free)
Refines clusters by root_cause embedding similarity (catches cases where two findings share semantic root cause but no overlapping chunks)
Rolls up severity within clusters; escalates by one tier when ≥3 findings share a root cause
Detects cross-finding patterns via LLM (REASONING_HIGH) — the one place we want maximum reasoning quality, because patterns are what get top billing in the deliverable
Generates follow-up targets for the next iteration round (REASONING_MID)

The deepening LOOP is what makes AuditForge produce defensible audits at scale: one round finds tactical issues; subsequent rounds investigate adjacent areas suggested by what was already found. Iteration cap or budget cap stops the loop.

Output: `DeepenResult`¶

@dataclass
class DeepenResult:
    clusters: list[FindingCluster]            # findings grouped by relation
    patterns: list[str]                       # cross-finding pattern descriptions
    follow_up_targets: list[FollowUpTarget]   # next-round catalog targets

@dataclass
class FindingCluster:
    cluster_id: str                           # cl-{8 hex chars}
    finding_ids: list[str]
    shared_chunk_ids: list[str]               # evidence overlap
    rolled_up_severity: Severity              # max of constituent severities, escalated at 3+
    pattern_description: str | None           # filled by LLM if a pattern matches
    pattern_remediation_focus: str | None

@dataclass
class FollowUpTarget:
    primitive: str                            # one of the six primitives
    description: str                          # natural-language target description
    parent_finding_ids: list[str]             # which findings spawned this target
    priority_hint: float                      # 0-1; informs Stage B priority

Pipeline¶

findings (from Stage E)
       │
       ▼
1. cluster_by_evidence (pure)
   - Greedy grouping by shared chunk_ids ≥ min_shared_chunks (default 1)
   - Severity rollup: max(severities) within cluster
   - Escalate by one tier if cluster has ≥3 findings (systemic signal)
       │
       ▼
2. refine_clusters_by_similarity (pure + embeddings)
   - Embed root_cause text (fallback to description) per cluster representative
   - Pairwise cosine similarity ≥ similarity_threshold (default 0.85) → merge
   - Catches semantic root-cause alignment that didn't share chunks
       │
       ▼
3. annotate_related_findings
   - Populate Finding.related_finding_ids in-place from cluster membership
       │
       ▼
4. _detect_patterns (LLM, REASONING_HIGH)         ┐
                                                  │ asyncio.gather
4. _generate_followup_targets (LLM, REASONING_MID)┘
       │
       ▼
5. Attach pattern descriptions to clusters by finding_id overlap
       │
       ▼
DeepenResult

Severity rollup logic¶

sev_rank = {LOW: 1, MEDIUM: 2, HIGH: 3, CRITICAL: 4}

rolled_severity = max(sev_rank[f.severity] for f in cluster_findings)
if len(cluster_findings) >= 3:
    rolled_severity = min(4, rolled_severity + 1)   # escalate one tier

3 MEDIUM findings sharing a root cause becomes a HIGH cluster. 3 HIGH becomes CRITICAL. The escalation reflects auditor judgment: a single medium issue is one ticket; three of the same medium issue is a systemic problem deserving partner attention.

LLM pattern detection¶

The pattern-detection prompt is the one place we use REASONING_HIGH (Opus 4.7). Patterns are what get top billing in the deliverable — the firm's partner reads them first, signs off on them, and uses them to drive the remediation conversation with their client.

The prompt:

You are a senior auditor reviewing findings from a deep audit. Identify
cross-cutting patterns — situations where multiple findings stem from a
single root cause or systemic issue. A pattern matters when N findings
reduce to 1 explanation.

Output STRICT JSON: {"patterns":[{"description","finding_ids","remediation_focus"}]}
Up to 8 patterns.

Input is the serialized findings + clusters (capped at 60 findings to bound prompt cost; for very large audits we'd need to do hierarchical synthesis — Phase 2 hardening).

Follow-up target generation¶

Same input, REASONING_MID tier (cheaper because the output is more mechanical). Generates up to 20 targets per primitive, each with a priority_hint that Stage B's catalog generator weighs in the next iteration.

Invalid primitive names (LLM hallucination) are dropped — only the six known primitives pass through.

Cost shape¶

Per deepen pass: - 1 REASONING_HIGH call (pattern detection): ~$0.10–0.30 depending on prompt size - 1 REASONING_MID call (follow-up targets): ~$0.04 - Total per deepen pass: ~$0.15–0.35

3 iteration rounds → 2 deepen passes (no deepen after final round) → ~$0.30–0.70. Negligible compared to Stage E's investigation cost.

Failure isolation¶

Each LLM call wraps in try/except. Failures return empty lists rather than aborting the whole stage:

Pattern detection fails → no patterns, but clustering is preserved
Follow-up target generation fails → no follow-ups, but iteration loop caller can choose to retry or exit the loop

The clustering steps are pure (no LLM); they always succeed.

Test coverage¶

Area	Cases
`cluster_by_evidence`	8 (empty, isolated, shared chunks, severity max, 3+ escalation, critical cap, threshold, no chunks)
`refine_clusters_by_similarity`	4 (orthogonal no-merge, similar merge, description fallback, no signal short-circuit)
`annotate_related_findings`	1 (in-place population)
End-to-end `cluster_and_deepen`	3 (full pipeline mocked LLM, empty findings, invalid primitive dropped)

All 16 cases passing. Full suite: 361 pass, no regressions.

Public API¶

async def cluster_and_deepen(
    engagement_id: str,
    findings: list[Finding],
    *,
    llm: LLMClient,
    embedder: Embedder | None = None,
    min_shared_chunks: int = 1,
    similarity_threshold: float = 0.85,
) -> DeepenResult

How the iteration loop wires up¶

The deepening loop lives at the engagement-runner level (not inside Stage F). The pattern is:

catalog = await build_catalog(profile, intake, archetype, [], 0, llm)
questions = await synthesize_questions(catalog, intake, archetype, llm)
validated = await validate_questions(questions, ...)
result = await run_investigation(engagement, validated.questions, ...)

for iteration in range(1, max_iterations + 1):
    deepen = await cluster_and_deepen(engagement.id, result.findings, llm=llm)

    if not deepen.follow_up_targets:
        break
    if budget.utilization > convergence_budget_pct:
        break

    # Next iteration: build catalog using prior findings + follow-up hints
    catalog = await build_catalog(
        profile, intake, archetype, result.findings, iteration, llm,
    )
    questions = await synthesize_questions(catalog, intake, archetype, llm)
    validated = await validate_questions(questions, ...)
    result = await run_investigation(engagement, validated.questions, ...)

The runner is built in a future commit (probably alongside Stage G's deliverable generation).

Known limits / future work¶

Pattern detection capped at 60 findings. Large engagements (200+ findings) need hierarchical pattern synthesis: cluster findings into buckets, summarize each bucket, run pattern detection over summaries. Phase 2 hardening.
Follow-up targets feed Stage B descriptively, not structurally. The next-round catalog uses follow-up descriptions as hints, not pre-built typed targets. Stage B re-runs full LLM generation. Could optimize to skip primitives with no follow-up targets, saving cost.
No iteration convergence detector. Stage F doesn't decide when to stop iterating — that's the runner's job. Could add a convergence signal (e.g., new findings rate dropped below threshold).