Stage B — Catalog¶

Status: ✅ Complete File: app/auditforge/catalog.py Tests: tests/test_auditforge_catalog.py — 23 cases passing

Purpose¶

Stage B is the central engineering challenge of AuditForge. It takes the corpus profile (Stage A), the engagement intake, the archetype config, and prior findings (in deepening rounds) — and produces structured target lists per primitive. Each list ranks targets by combined corpus signal, intake alignment, and archetype emphasis, capped to bound downstream cost.

Without vertical packs, this is where the moat lives: the LLM does the catalog generation, and prompt design + intake quality + archetype weighting determine whether the resulting targets are sharp or generic.

Output: `Catalog`¶

@dataclass
class Catalog:
    engagement_id: str
    iteration: int                                # 0 = initial, 1+ = deepening
    concepts: list[ConceptTarget]                 # for conflict + consistency
    doc_pairs: list[DocPairTarget]                # for flow_down
    required_elements: list[RequiredElement]      # for coverage
    currency_rules: list[CurrencyRule]            # for currency
    defined_terms: list[DefinedTermTarget]        # for consistency
    citation_tuples: list[CitationTuple]          # for citation_integrity

Each list is capped (concepts: 30, doc_pairs: 15, required_elements: 30, currency_rules: 20, defined_terms: 20, citation_tuples: 50). Caps bound investigation cost — Stage E pays roughly (target_count × per-question-cost) per primitive. With these caps and ~$0.05 per question, full investigation costs $20–60 per audit.

Pipeline¶

Five LLM calls + one pure derivation, all running in parallel via asyncio.gather.

profile + intake + prior_findings + archetype.weights
                       │
        ┌──────────────┼──────────────┬──────────────┬──────────────┐
        ▼              ▼              ▼              ▼              ▼
 _gen_concepts   _gen_doc_pairs   _gen_required   _gen_currency   _gen_defined
 (LLM, MID)      (LLM, MID)       _elements       _rules          _terms
                                  (LLM, MID)      (LLM, MID)      (LLM, MID)
        │              │              │              │              │
        └──────────────┴──────────────┴──────────────┴──────────────┘
                                       │
                                       ▼
                          (asyncio.gather, parallel)
                                       │
                                       ▼
                         build_citation_tuples (pure)
                                       │
                                       ▼
                                  Catalog

Citations are pure — Stage A already extracted them via regex; Stage B just ranks by frequency and applies the archetype weight for citation_integrity_check.

LLM cost shape¶

Five REASONING_MID calls per catalog round. Per Sonnet 4.6 pricing: - Per call: ~5K input tokens × $3/1M + ~1.5K output tokens × $15/1M ≈ $0.038 - Total per catalog round: ~$0.19 - Iteration 2 + 3 add another $0.38 → full audit catalog cost ~$0.57

This is a small fraction of the investigation cost (Stage E) where the real budget goes.

Context block (`build_context_block`)¶

Pure function that renders the shared input block consumed by all six primitive prompts. Bounded length so token cost stays predictable.

Sections rendered:

Corpus profile summary
total docs / chunks
Top 10 doc types with counts
Date range (first to last year, count)
Jurisdictions (top 10)
Inferred domain / audit_type / frameworks (Stage A LLM output)
Topic clusters — top 25 with size + label
Most-cited references — top 20 (kind:target → count) from Stage A citations, used for currency and citation_integrity catalog generation
Auditor intake — domain, audit_purpose, frameworks, focus_areas, materiality, doc_hierarchy (top 8), known_concerns (top 8). Empty fields skipped.
Prior findings — only when iteration > 0. Top 20 by severity, showing severity tier, primitive, and 200-char description excerpt.

Per-primitive prompts¶

Each prompt is a system message tightly specifying: - Role ("You are an expert auditor building a target catalog") - The catalog target type - Output schema (strict JSON only, no surrounding prose) - Quality criteria specific to the primitive - Result cap

Prompt summaries:

Primitive	Bias signal	Output cap
`concepts`	intake.focus_areas, intake.frameworks, cluster topology	25
`doc_pairs`	intake.doc_hierarchy (primary), corpus doc_type distribution	12
`required_elements`	intake.frameworks + Stage A inferred_frameworks	25
`currency_rules`	profile.citations (frequency-ranked), frameworks	15
`defined_terms`	LLM domain knowledge of the domain's high-risk terms	15

In iteration > 0, the concept generator prompt is augmented with: "Generate FOLLOW-UP concepts informed by the prior findings — probe related areas, deeper sub-topics, or adjacent concerns."

All targets in iteration > 0 catalog rounds are tagged source="iteration" so downstream stages can attribute findings to deepening logic.

Priority ranking and weighting¶

Each target carries a priority: float in [0, 1]. The final priority is computed as:

final_priority = base_priority × archetype_weight × intake_boost

where: - base_priority is the LLM-assigned 0-1 score - archetype_weight comes from ArchetypeConfig.primitive_weights[primitive] - intake_boost (concepts only) = 1.25 if label/seeds match a focus_area, × 1.15 if matches a framework, otherwise 1.0

Concepts feed both conflict_check and consistency_check, so the archetype weight is averaged across those two primitives.

Final priority is clamped to ≤ 1.0. Each list is sorted descending then capped to its MAX_* limit.

Citation tuples (`build_citation_tuples`)¶

Pure function — no LLM call. Promotes Stage A citations to CitationTuple objects with frequency-based priority:

priority = 0.5 + 0.5 × (frequency / max_frequency)
priority *= archetype_weight["citation_integrity_check"]
priority = min(1.0, priority)

Dedupes within (citing_doc, citing_section, to_target) tuples; picks the most-cited targets first. The cited_subject is rendered as "{kind}:{target}" (e.g., "far:52.204-21") so downstream primitives have a stable identifier.

Failure isolation¶

Each generator wraps its LLM call in try/except. If a call fails (timeout, unparseable JSON, provider error), the generator returns [] for that primitive — the rest of the catalog still completes. Logged as auditforge_catalog_llm_failed | step=X err=Y.

This matters for production: a transient Anthropic outage during catalog generation should not abort an audit that's already cost the firm money to profile. Stage D (validate) and Stage E (investigate) run with whatever catalog content is available.

JSON parsing tolerance¶

parse_strict_json() strips code fences, leading prose, and finds the first/last { / } to extract the object. Returns {} on unparseable input. REASONING_MID models reliably output clean JSON when the system prompt asks for it, but the parser is defensive.

Per-target field validation: - Strings truncated to bounded lengths (label 200, description 600) - List items truncated to bounded lengths and counts - Priority clamped to [0, 1] - Optional strings normalize empty → None

Test coverage¶

Area	Cases
`build_context_block`	5 (profile, intake, omitted-empty, iteration gating, severity sort)
`parse_strict_json`	5 (clean, fenced, prose, garbage, array-only)
Priority helpers	4 (clamp normal, clamp bounds, clamp garbage, optional_str)
`build_citation_tuples`	4 (basic frequency, archetype weight, dedupe, empty)
`_apply_concept_weighting`	3 (focus_area boost, archetype average, cap at 1.0)
End-to-end mocked LLM	1 (all six primitives populated, REASONING_MID tier, 5 calls)
Iteration deepening	1 (prior findings in prompts, source tagged)

All 23 cases passing. The end-to-end test verifies the parallel dispatch glues correctly: scripted JSON per primitive flows through to typed targets in the right slots.

Public API¶

async def build_catalog(
    profile: CorpusProfile,
    intake: IntakeData,
    archetype: ArchetypeKind,
    prior_findings: list[Finding],
    iteration: int,
    llm: LLMClient,
) -> Catalog

Known limits / future work¶

No external knowledge lookup. Required-element generation relies on the LLM's training-data knowledge of frameworks. For low-resource or emerging frameworks, results may be thin. Phase 2: optional Context7-style framework reference cache.
Defined terms are LLM-proposed, not corpus-extracted. A term proposed by the LLM might not actually be defined in the corpus, in which case Stage D (validation) drops it via the relevance floor. We could pre-filter by extracting terms via regex-based "X means Y" / "X is defined as Y" patterns from the corpus first. Phase 1.5 hardening.
Currency rules rely on LLM knowledge of supersession chains. For obscure standards this is brittle. Phase 2: optional ecfr.gov / NIST publication-database lookup for currency verification.
Per-iteration token cost is roughly fixed. Three rounds of deepening cost ~3× the catalog cost. Future optimization: skip primitives that produced no findings in prior rounds.