Skip to content

Stage B — Catalog

Status: ✅ Complete File: app/auditforge/catalog.py Tests: tests/test_auditforge_catalog.py — 23 cases passing

Purpose

Stage B is the central engineering challenge of AuditForge. It takes the corpus profile (Stage A), the engagement intake, the archetype config, and prior findings (in deepening rounds) — and produces structured target lists per primitive. Each list ranks targets by combined corpus signal, intake alignment, and archetype emphasis, capped to bound downstream cost.

Without vertical packs, this is where the moat lives: the LLM does the catalog generation, and prompt design + intake quality + archetype weighting determine whether the resulting targets are sharp or generic.

Output: Catalog

@dataclass
class Catalog:
    engagement_id: str
    iteration: int                                # 0 = initial, 1+ = deepening
    concepts: list[ConceptTarget]                 # for conflict + consistency
    doc_pairs: list[DocPairTarget]                # for flow_down
    required_elements: list[RequiredElement]      # for coverage
    currency_rules: list[CurrencyRule]            # for currency
    defined_terms: list[DefinedTermTarget]        # for consistency
    citation_tuples: list[CitationTuple]          # for citation_integrity

Each list is capped (concepts: 30, doc_pairs: 15, required_elements: 30, currency_rules: 20, defined_terms: 20, citation_tuples: 50). Caps bound investigation cost — Stage E pays roughly (target_count × per-question-cost) per primitive. With these caps and ~$0.05 per question, full investigation costs $20–60 per audit.

Pipeline

Five LLM calls + one pure derivation, all running in parallel via asyncio.gather.

profile + intake + prior_findings + archetype.weights
        ┌──────────────┼──────────────┬──────────────┬──────────────┐
        ▼              ▼              ▼              ▼              ▼
 _gen_concepts   _gen_doc_pairs   _gen_required   _gen_currency   _gen_defined
 (LLM, MID)      (LLM, MID)       _elements       _rules          _terms
                                  (LLM, MID)      (LLM, MID)      (LLM, MID)
        │              │              │              │              │
        └──────────────┴──────────────┴──────────────┴──────────────┘
                          (asyncio.gather, parallel)
                         build_citation_tuples (pure)
                                  Catalog

Citations are pure — Stage A already extracted them via regex; Stage B just ranks by frequency and applies the archetype weight for citation_integrity_check.

LLM cost shape

Five REASONING_MID calls per catalog round. Per Sonnet 4.6 pricing: - Per call: ~5K input tokens × $3/1M + ~1.5K output tokens × $15/1M ≈ $0.038 - Total per catalog round: ~$0.19 - Iteration 2 + 3 add another $0.38 → full audit catalog cost ~$0.57

This is a small fraction of the investigation cost (Stage E) where the real budget goes.

Context block (build_context_block)

Pure function that renders the shared input block consumed by all six primitive prompts. Bounded length so token cost stays predictable.

Sections rendered:

  1. Corpus profile summary
  2. total docs / chunks
  3. Top 10 doc types with counts
  4. Date range (first to last year, count)
  5. Jurisdictions (top 10)
  6. Inferred domain / audit_type / frameworks (Stage A LLM output)

  7. Topic clusters — top 25 with size + label

  8. Most-cited references — top 20 (kind:target → count) from Stage A citations, used for currency and citation_integrity catalog generation

  9. Auditor intake — domain, audit_purpose, frameworks, focus_areas, materiality, doc_hierarchy (top 8), known_concerns (top 8). Empty fields skipped.

  10. Prior findings — only when iteration > 0. Top 20 by severity, showing severity tier, primitive, and 200-char description excerpt.

Per-primitive prompts

Each prompt is a system message tightly specifying: - Role ("You are an expert auditor building a target catalog") - The catalog target type - Output schema (strict JSON only, no surrounding prose) - Quality criteria specific to the primitive - Result cap

Prompt summaries:

Primitive Bias signal Output cap
concepts intake.focus_areas, intake.frameworks, cluster topology 25
doc_pairs intake.doc_hierarchy (primary), corpus doc_type distribution 12
required_elements intake.frameworks + Stage A inferred_frameworks 25
currency_rules profile.citations (frequency-ranked), frameworks 15
defined_terms LLM domain knowledge of the domain's high-risk terms 15

In iteration > 0, the concept generator prompt is augmented with: "Generate FOLLOW-UP concepts informed by the prior findings — probe related areas, deeper sub-topics, or adjacent concerns."

All targets in iteration > 0 catalog rounds are tagged source="iteration" so downstream stages can attribute findings to deepening logic.

Priority ranking and weighting

Each target carries a priority: float in [0, 1]. The final priority is computed as:

final_priority = base_priority × archetype_weight × intake_boost

where: - base_priority is the LLM-assigned 0-1 score - archetype_weight comes from ArchetypeConfig.primitive_weights[primitive] - intake_boost (concepts only) = 1.25 if label/seeds match a focus_area, × 1.15 if matches a framework, otherwise 1.0

Concepts feed both conflict_check and consistency_check, so the archetype weight is averaged across those two primitives.

Final priority is clamped to ≤ 1.0. Each list is sorted descending then capped to its MAX_* limit.

Citation tuples (build_citation_tuples)

Pure function — no LLM call. Promotes Stage A citations to CitationTuple objects with frequency-based priority:

priority = 0.5 + 0.5 × (frequency / max_frequency)
priority *= archetype_weight["citation_integrity_check"]
priority = min(1.0, priority)

Dedupes within (citing_doc, citing_section, to_target) tuples; picks the most-cited targets first. The cited_subject is rendered as "{kind}:{target}" (e.g., "far:52.204-21") so downstream primitives have a stable identifier.

Failure isolation

Each generator wraps its LLM call in try/except. If a call fails (timeout, unparseable JSON, provider error), the generator returns [] for that primitive — the rest of the catalog still completes. Logged as auditforge_catalog_llm_failed | step=X err=Y.

This matters for production: a transient Anthropic outage during catalog generation should not abort an audit that's already cost the firm money to profile. Stage D (validate) and Stage E (investigate) run with whatever catalog content is available.

JSON parsing tolerance

parse_strict_json() strips code fences, leading prose, and finds the first/last { / } to extract the object. Returns {} on unparseable input. REASONING_MID models reliably output clean JSON when the system prompt asks for it, but the parser is defensive.

Per-target field validation: - Strings truncated to bounded lengths (label 200, description 600) - List items truncated to bounded lengths and counts - Priority clamped to [0, 1] - Optional strings normalize empty → None

Test coverage

Area Cases
build_context_block 5 (profile, intake, omitted-empty, iteration gating, severity sort)
parse_strict_json 5 (clean, fenced, prose, garbage, array-only)
Priority helpers 4 (clamp normal, clamp bounds, clamp garbage, optional_str)
build_citation_tuples 4 (basic frequency, archetype weight, dedupe, empty)
_apply_concept_weighting 3 (focus_area boost, archetype average, cap at 1.0)
End-to-end mocked LLM 1 (all six primitives populated, REASONING_MID tier, 5 calls)
Iteration deepening 1 (prior findings in prompts, source tagged)

All 23 cases passing. The end-to-end test verifies the parallel dispatch glues correctly: scripted JSON per primitive flows through to typed targets in the right slots.

Public API

async def build_catalog(
    profile: CorpusProfile,
    intake: IntakeData,
    archetype: ArchetypeKind,
    prior_findings: list[Finding],
    iteration: int,
    llm: LLMClient,
) -> Catalog

Known limits / future work

  • No external knowledge lookup. Required-element generation relies on the LLM's training-data knowledge of frameworks. For low-resource or emerging frameworks, results may be thin. Phase 2: optional Context7-style framework reference cache.
  • Defined terms are LLM-proposed, not corpus-extracted. A term proposed by the LLM might not actually be defined in the corpus, in which case Stage D (validation) drops it via the relevance floor. We could pre-filter by extracting terms via regex-based "X means Y" / "X is defined as Y" patterns from the corpus first. Phase 1.5 hardening.
  • Currency rules rely on LLM knowledge of supersession chains. For obscure standards this is brittle. Phase 2: optional ecfr.gov / NIST publication-database lookup for currency verification.
  • Per-iteration token cost is roughly fixed. Three rounds of deepening cost ~3× the catalog cost. Future optimization: skip primitives that produced no findings in prior rounds.