Stage C — Synthesize¶

Status: ✅ Complete Files: app/auditforge/synthesizer.py, app/auditforge/primitives/*.py Tests: tests/test_auditforge_synthesizer.py — 19 cases passing

Purpose¶

Stage C turns catalog targets into executable Questions. Each Question is self-describing: it carries the runtime prompt template, retrieval scope, expected evidence shape, severity weight, and budget cap so the orchestrator (Stage E) can run it without external context.

Synthesis is mostly pure — no LLM calls. The runtime prompts are parameterized templates that Stage E fills with retrieved context at execution time. This keeps Stage C cost essentially zero and makes question generation deterministic given the same catalog input.

Output: `list[Question]`¶

Sorted descending by archetype_weight × severity_weight so Stage D validation operates with a stable preference order to break ties on.

@dataclass
class Question:
    id: str                              # q-{12 hex chars}
    parent_id: str | None                # for follow-up chains
    primitive: str                       # which executor to dispatch to
    dimension: str                       # human-readable label

    archetype_weight: float              # set by synthesizer dispatch
    severity_weight: float               # derived from target priority

    scope: QuestionScope                 # retrieval scope
    prompt_template: str                 # runtime prompt with {placeholders}
    prompt_variables: dict               # values for placeholders
    expected_evidence_shape: dict        # validator/parser hints

    budget_cents: int                    # per-question soft cap

    # Filled by Stage E:
    retrieval_results: list[dict]
    llm_response: str | None
    finding_id: str | None

Pipeline¶

Catalog                                     Per-primitive synthesize() functions
                                            (template-only, no LLM)
                       ┌────────────────────────────────────────┐
catalog.concepts ─────►│ conflict.synthesize()                  │
                       │ consistency.synthesize()               │
catalog.defined_terms ►│                                        │
catalog.required ─────►│ coverage.synthesize()                  │
catalog.currency_rules►│ currency.synthesize()                  │
catalog.doc_pairs ────►│ flow_down.synthesize()                 │
catalog.citation_tuples│ citation_integrity.synthesize()        │
                       └────────────────────────────────────────┘
                                            │
                                            ▼
                          synthesize_questions() applies
                          archetype.primitive_weights to each
                          question's archetype_weight
                                            │
                                            ▼
                          Sort by archetype × severity desc
                                            │
                                            ▼
                                     list[Question]

Per-primitive synthesis¶

Each primitive's synthesize(targets, intake, llm) is async by interface convention but does no await work in v1. Each maps targets to questions 1-to-1, except flow_down which expands by expected_clause_classes.

Primitive	Mapping	Severity weight tiers (priority → weight)
`conflict_check`	1 question per ConceptTarget	0.8+ → 0.9, 0.6+ → 0.7, 0.4+ → 0.5, else 0.3
`consistency_check`	1 question per DefinedTermTarget	0.8+ → 0.85, 0.6+ → 0.65, else 0.45
`coverage_check`	1 question per RequiredElement	0.8+ → 0.9, 0.6+ → 0.7, else 0.5
`currency_check`	1 question per CurrencyRule	0.8+ → 0.85, 0.6+ → 0.65, else 0.45
`flow_down_check`	1 question per (DocPair × clause_class)	0.8+ → 0.95, 0.6+ → 0.75, else 0.55
`citation_integrity_check`	1 question per CitationTuple	0.9+ → 0.7, 0.7+ → 0.5, else 0.35

Severity tier rationales: - flow_down_check highest because flow-down gaps in govt-contractor work are typically high-stakes (compliance breach risk). - coverage_check next because absent required elements often map to remediation engagements directly. - citation_integrity_check lowest at top because citation issues are often clerical rather than substantive.

Runtime prompt templates¶

Each primitive's RUNTIME_PROMPT_TEMPLATE is the prompt Stage E sends to the LLM at execution. Templates use str.format-style placeholders:

{retrieved_context} — injected by Stage E from corpus retrieval
Per-primitive variables — populated at synthesis time and stored on the Question (e.g., {concept_label}, {element_name}, {parent_doc_type}, {cited_subject})

Each template specifies a strict-JSON output schema with: - A found_* boolean (whether a finding fired) - severity ∈ {critical, high, medium, low} - confidence ∈ [0, 1] - description, evidence (verbatim quotes, doc-anchored) - A remediation block: scope_of_work, estimated_effort_hours, risk_if_unaddressed

This unified output shape means Stage E can use one parser to convert LLM responses to Finding objects regardless of primitive.

Evidence shapes¶

EVIDENCE_SHAPE per primitive declares minimum evidence requirements that Stage D validation and Stage E parsing enforce:

EVIDENCE_SHAPE = {
    "min_evidence_items": int,           # minimum verbatim quotes
    "verbatim_quotes_required": bool,    # absence-findings can skip quotes
    "fields": list[str],                 # expected JSON fields
}

coverage_check allows min_evidence_items=0 because absence findings have no positive evidence to quote (the absence itself is the evidence). Other primitives require at least 1-2 verbatim quotes.

Top-level dispatch¶

async def synthesize_questions(
    catalog: Catalog,
    intake: IntakeData,
    archetype: ArchetypeKind,
    llm: LLMClient,
) -> list[Question]

For each primitive, fetch its module, call synthesize(targets, intake, llm), apply the archetype's weight to each returned question, accumulate, then sort by archetype_weight × severity_weight descending. Failures isolated: one primitive's synthesize throwing returns no questions for that primitive but doesn't abort the rest.

Per-question budget¶

Default budget_cents=5 (flow_down: 7 due to paired retrieval, citation_integrity: 4 due to typically short retrieval). Stage E enforces these soft caps; if a question is approaching its budget, the orchestrator may downgrade tier or truncate retrieval before the hard engagement-level cap fires.

Question identifiers¶

make_question_id() produces q-{12 hex chars}. Vanishingly rare collision; non-leaking (no business volume signal).

Parent IDs (for deepening chains) are populated when a follow-up catalog round produces questions stemming from a specific finding. The lineage is preserved through Stage F so deliverables can show why a follow-up question was asked.

Test coverage¶

Area	Cases
`conflict.synthesize`	4 (per-target mapping, severity tiering, scope, evidence shape)
`coverage.synthesize`	2 (with/without doc_type fallback)
`flow_down.synthesize`	4 (per-clause expansion, general fallback, scope, severity)
`consistency.synthesize`	1 (per-term mapping)
`currency.synthesize`	1 (per-rule mapping with supersession context)
`citation_integrity.synthesize`	1 (per-tuple mapping)
Top-level dispatch	4 (all primitives invoked, archetype weights, sort order, empty catalog)
Question dataclass + IDs	2 (id uniqueness, default values)

All 19 cases passing. Full suite: 292 pass, no regressions.

Public API¶

from app.auditforge.synthesizer import synthesize_questions, Question, QuestionScope
from app.auditforge.primitives import conflict, coverage, currency, ...

# Top-level dispatch (recommended)
questions = await synthesize_questions(catalog, intake, archetype, llm)

# Direct per-primitive (testing or specialized flows)
qs = await conflict.synthesize(catalog.concepts, intake, llm)

Known limits / future work¶

All synthesis is template-based. For complex catalogs we may want LLM-driven question synthesis where the LLM proposes specialized per-target prompts. Phase 2 hardening if template prompts plateau.
No question deduplication at synthesis time — Stage D handles dedupe via embedding similarity. This means very similar concepts may produce near-duplicate questions until Stage D catches them. Acceptable trade-off; keeps Stage C deterministic.
flow_down_check expansion by clause-class can produce many questions for one DocPair with many classes. The MAX_DOC_PAIRS=15 catalog cap combined with the per-pair clause-class list bounds this; in extreme cases Stage D dedupe will prune.