Stage B — Catalog¶
Status: ✅ Complete
File: app/auditforge/catalog.py
Tests: tests/test_auditforge_catalog.py — 23 cases passing
Purpose¶
Stage B is the central engineering challenge of AuditForge. It takes the corpus profile (Stage A), the engagement intake, the archetype config, and prior findings (in deepening rounds) — and produces structured target lists per primitive. Each list ranks targets by combined corpus signal, intake alignment, and archetype emphasis, capped to bound downstream cost.
Without vertical packs, this is where the moat lives: the LLM does the catalog generation, and prompt design + intake quality + archetype weighting determine whether the resulting targets are sharp or generic.
Output: Catalog¶
@dataclass
class Catalog:
engagement_id: str
iteration: int # 0 = initial, 1+ = deepening
concepts: list[ConceptTarget] # for conflict + consistency
doc_pairs: list[DocPairTarget] # for flow_down
required_elements: list[RequiredElement] # for coverage
currency_rules: list[CurrencyRule] # for currency
defined_terms: list[DefinedTermTarget] # for consistency
citation_tuples: list[CitationTuple] # for citation_integrity
Each list is capped (concepts: 30, doc_pairs: 15, required_elements: 30,
currency_rules: 20, defined_terms: 20, citation_tuples: 50). Caps bound
investigation cost — Stage E pays roughly (target_count × per-question-cost)
per primitive. With these caps and ~$0.05 per question, full investigation
costs $20–60 per audit.
Pipeline¶
Five LLM calls + one pure derivation, all running in parallel via
asyncio.gather.
profile + intake + prior_findings + archetype.weights
│
┌──────────────┼──────────────┬──────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
_gen_concepts _gen_doc_pairs _gen_required _gen_currency _gen_defined
(LLM, MID) (LLM, MID) _elements _rules _terms
(LLM, MID) (LLM, MID) (LLM, MID)
│ │ │ │ │
└──────────────┴──────────────┴──────────────┴──────────────┘
│
▼
(asyncio.gather, parallel)
│
▼
build_citation_tuples (pure)
│
▼
Catalog
Citations are pure — Stage A already extracted them via regex; Stage B just
ranks by frequency and applies the archetype weight for citation_integrity_check.
LLM cost shape¶
Five REASONING_MID calls per catalog round. Per Sonnet 4.6 pricing: - Per call: ~5K input tokens × $3/1M + ~1.5K output tokens × $15/1M ≈ $0.038 - Total per catalog round: ~$0.19 - Iteration 2 + 3 add another $0.38 → full audit catalog cost ~$0.57
This is a small fraction of the investigation cost (Stage E) where the real budget goes.
Context block (build_context_block)¶
Pure function that renders the shared input block consumed by all six primitive prompts. Bounded length so token cost stays predictable.
Sections rendered:
- Corpus profile summary
- total docs / chunks
- Top 10 doc types with counts
- Date range (first to last year, count)
- Jurisdictions (top 10)
-
Inferred domain / audit_type / frameworks (Stage A LLM output)
-
Topic clusters — top 25 with size + label
-
Most-cited references — top 20 (kind:target → count) from Stage A citations, used for currency and citation_integrity catalog generation
-
Auditor intake — domain, audit_purpose, frameworks, focus_areas, materiality, doc_hierarchy (top 8), known_concerns (top 8). Empty fields skipped.
-
Prior findings — only when
iteration > 0. Top 20 by severity, showing severity tier, primitive, and 200-char description excerpt.
Per-primitive prompts¶
Each prompt is a system message tightly specifying: - Role ("You are an expert auditor building a target catalog") - The catalog target type - Output schema (strict JSON only, no surrounding prose) - Quality criteria specific to the primitive - Result cap
Prompt summaries:
| Primitive | Bias signal | Output cap |
|---|---|---|
concepts |
intake.focus_areas, intake.frameworks, cluster topology | 25 |
doc_pairs |
intake.doc_hierarchy (primary), corpus doc_type distribution | 12 |
required_elements |
intake.frameworks + Stage A inferred_frameworks | 25 |
currency_rules |
profile.citations (frequency-ranked), frameworks | 15 |
defined_terms |
LLM domain knowledge of the domain's high-risk terms | 15 |
In iteration > 0, the concept generator prompt is augmented with: "Generate FOLLOW-UP concepts informed by the prior findings — probe related areas, deeper sub-topics, or adjacent concerns."
All targets in iteration > 0 catalog rounds are tagged source="iteration"
so downstream stages can attribute findings to deepening logic.
Priority ranking and weighting¶
Each target carries a priority: float in [0, 1]. The final priority is
computed as:
where:
- base_priority is the LLM-assigned 0-1 score
- archetype_weight comes from ArchetypeConfig.primitive_weights[primitive]
- intake_boost (concepts only) = 1.25 if label/seeds match a focus_area,
× 1.15 if matches a framework, otherwise 1.0
Concepts feed both conflict_check and consistency_check, so the
archetype weight is averaged across those two primitives.
Final priority is clamped to ≤ 1.0. Each list is sorted descending then capped to its MAX_* limit.
Citation tuples (build_citation_tuples)¶
Pure function — no LLM call. Promotes Stage A citations to CitationTuple
objects with frequency-based priority:
priority = 0.5 + 0.5 × (frequency / max_frequency)
priority *= archetype_weight["citation_integrity_check"]
priority = min(1.0, priority)
Dedupes within (citing_doc, citing_section, to_target) tuples; picks the
most-cited targets first. The cited_subject is rendered as
"{kind}:{target}" (e.g., "far:52.204-21") so downstream primitives
have a stable identifier.
Failure isolation¶
Each generator wraps its LLM call in try/except. If a call fails (timeout,
unparseable JSON, provider error), the generator returns [] for that
primitive — the rest of the catalog still completes. Logged as
auditforge_catalog_llm_failed | step=X err=Y.
This matters for production: a transient Anthropic outage during catalog generation should not abort an audit that's already cost the firm money to profile. Stage D (validate) and Stage E (investigate) run with whatever catalog content is available.
JSON parsing tolerance¶
parse_strict_json() strips code fences, leading prose, and finds the
first/last { / } to extract the object. Returns {} on unparseable
input. REASONING_MID models reliably output clean JSON when the system
prompt asks for it, but the parser is defensive.
Per-target field validation: - Strings truncated to bounded lengths (label 200, description 600) - List items truncated to bounded lengths and counts - Priority clamped to [0, 1] - Optional strings normalize empty → None
Test coverage¶
| Area | Cases |
|---|---|
build_context_block |
5 (profile, intake, omitted-empty, iteration gating, severity sort) |
parse_strict_json |
5 (clean, fenced, prose, garbage, array-only) |
| Priority helpers | 4 (clamp normal, clamp bounds, clamp garbage, optional_str) |
build_citation_tuples |
4 (basic frequency, archetype weight, dedupe, empty) |
_apply_concept_weighting |
3 (focus_area boost, archetype average, cap at 1.0) |
| End-to-end mocked LLM | 1 (all six primitives populated, REASONING_MID tier, 5 calls) |
| Iteration deepening | 1 (prior findings in prompts, source tagged) |
All 23 cases passing. The end-to-end test verifies the parallel dispatch glues correctly: scripted JSON per primitive flows through to typed targets in the right slots.
Public API¶
async def build_catalog(
profile: CorpusProfile,
intake: IntakeData,
archetype: ArchetypeKind,
prior_findings: list[Finding],
iteration: int,
llm: LLMClient,
) -> Catalog
Known limits / future work¶
- No external knowledge lookup. Required-element generation relies on the LLM's training-data knowledge of frameworks. For low-resource or emerging frameworks, results may be thin. Phase 2: optional Context7-style framework reference cache.
- Defined terms are LLM-proposed, not corpus-extracted. A term proposed by the LLM might not actually be defined in the corpus, in which case Stage D (validation) drops it via the relevance floor. We could pre-filter by extracting terms via regex-based "X means Y" / "X is defined as Y" patterns from the corpus first. Phase 1.5 hardening.
- Currency rules rely on LLM knowledge of supersession chains. For obscure standards this is brittle. Phase 2: optional ecfr.gov / NIST publication-database lookup for currency verification.
- Per-iteration token cost is roughly fixed. Three rounds of deepening cost ~3× the catalog cost. Future optimization: skip primitives that produced no findings in prior rounds.