Stage C — Synthesize¶
Status: ✅ Complete
Files: app/auditforge/synthesizer.py, app/auditforge/primitives/*.py
Tests: tests/test_auditforge_synthesizer.py — 19 cases passing
Purpose¶
Stage C turns catalog targets into executable Questions. Each Question is self-describing: it carries the runtime prompt template, retrieval scope, expected evidence shape, severity weight, and budget cap so the orchestrator (Stage E) can run it without external context.
Synthesis is mostly pure — no LLM calls. The runtime prompts are parameterized templates that Stage E fills with retrieved context at execution time. This keeps Stage C cost essentially zero and makes question generation deterministic given the same catalog input.
Output: list[Question]¶
Sorted descending by archetype_weight × severity_weight so Stage D
validation operates with a stable preference order to break ties on.
@dataclass
class Question:
id: str # q-{12 hex chars}
parent_id: str | None # for follow-up chains
primitive: str # which executor to dispatch to
dimension: str # human-readable label
archetype_weight: float # set by synthesizer dispatch
severity_weight: float # derived from target priority
scope: QuestionScope # retrieval scope
prompt_template: str # runtime prompt with {placeholders}
prompt_variables: dict # values for placeholders
expected_evidence_shape: dict # validator/parser hints
budget_cents: int # per-question soft cap
# Filled by Stage E:
retrieval_results: list[dict]
llm_response: str | None
finding_id: str | None
Pipeline¶
Catalog Per-primitive synthesize() functions
(template-only, no LLM)
┌────────────────────────────────────────┐
catalog.concepts ─────►│ conflict.synthesize() │
│ consistency.synthesize() │
catalog.defined_terms ►│ │
catalog.required ─────►│ coverage.synthesize() │
catalog.currency_rules►│ currency.synthesize() │
catalog.doc_pairs ────►│ flow_down.synthesize() │
catalog.citation_tuples│ citation_integrity.synthesize() │
└────────────────────────────────────────┘
│
▼
synthesize_questions() applies
archetype.primitive_weights to each
question's archetype_weight
│
▼
Sort by archetype × severity desc
│
▼
list[Question]
Per-primitive synthesis¶
Each primitive's synthesize(targets, intake, llm) is async by interface
convention but does no await work in v1. Each maps targets to questions
1-to-1, except flow_down which expands by expected_clause_classes.
| Primitive | Mapping | Severity weight tiers (priority → weight) |
|---|---|---|
conflict_check |
1 question per ConceptTarget | 0.8+ → 0.9, 0.6+ → 0.7, 0.4+ → 0.5, else 0.3 |
consistency_check |
1 question per DefinedTermTarget | 0.8+ → 0.85, 0.6+ → 0.65, else 0.45 |
coverage_check |
1 question per RequiredElement | 0.8+ → 0.9, 0.6+ → 0.7, else 0.5 |
currency_check |
1 question per CurrencyRule | 0.8+ → 0.85, 0.6+ → 0.65, else 0.45 |
flow_down_check |
1 question per (DocPair × clause_class) | 0.8+ → 0.95, 0.6+ → 0.75, else 0.55 |
citation_integrity_check |
1 question per CitationTuple | 0.9+ → 0.7, 0.7+ → 0.5, else 0.35 |
Severity tier rationales:
- flow_down_check highest because flow-down gaps in govt-contractor work
are typically high-stakes (compliance breach risk).
- coverage_check next because absent required elements often map to
remediation engagements directly.
- citation_integrity_check lowest at top because citation issues are
often clerical rather than substantive.
Runtime prompt templates¶
Each primitive's RUNTIME_PROMPT_TEMPLATE is the prompt Stage E sends
to the LLM at execution. Templates use str.format-style placeholders:
{retrieved_context}— injected by Stage E from corpus retrieval- Per-primitive variables — populated at synthesis time and stored on
the Question (e.g.,
{concept_label},{element_name},{parent_doc_type},{cited_subject})
Each template specifies a strict-JSON output schema with:
- A found_* boolean (whether a finding fired)
- severity ∈ {critical, high, medium, low}
- confidence ∈ [0, 1]
- description, evidence (verbatim quotes, doc-anchored)
- A remediation block: scope_of_work, estimated_effort_hours,
risk_if_unaddressed
This unified output shape means Stage E can use one parser to convert
LLM responses to Finding objects regardless of primitive.
Evidence shapes¶
EVIDENCE_SHAPE per primitive declares minimum evidence requirements
that Stage D validation and Stage E parsing enforce:
EVIDENCE_SHAPE = {
"min_evidence_items": int, # minimum verbatim quotes
"verbatim_quotes_required": bool, # absence-findings can skip quotes
"fields": list[str], # expected JSON fields
}
coverage_check allows min_evidence_items=0 because absence findings
have no positive evidence to quote (the absence itself is the evidence).
Other primitives require at least 1-2 verbatim quotes.
Top-level dispatch¶
async def synthesize_questions(
catalog: Catalog,
intake: IntakeData,
archetype: ArchetypeKind,
llm: LLMClient,
) -> list[Question]
For each primitive, fetch its module, call synthesize(targets, intake, llm),
apply the archetype's weight to each returned question, accumulate, then
sort by archetype_weight × severity_weight descending. Failures isolated:
one primitive's synthesize throwing returns no questions for that primitive
but doesn't abort the rest.
Per-question budget¶
Default budget_cents=5 (flow_down: 7 due to paired retrieval,
citation_integrity: 4 due to typically short retrieval). Stage E
enforces these soft caps; if a question is approaching its budget,
the orchestrator may downgrade tier or truncate retrieval before the
hard engagement-level cap fires.
Question identifiers¶
make_question_id() produces q-{12 hex chars}. Vanishingly rare
collision; non-leaking (no business volume signal).
Parent IDs (for deepening chains) are populated when a follow-up catalog round produces questions stemming from a specific finding. The lineage is preserved through Stage F so deliverables can show why a follow-up question was asked.
Test coverage¶
| Area | Cases |
|---|---|
conflict.synthesize |
4 (per-target mapping, severity tiering, scope, evidence shape) |
coverage.synthesize |
2 (with/without doc_type fallback) |
flow_down.synthesize |
4 (per-clause expansion, general fallback, scope, severity) |
consistency.synthesize |
1 (per-term mapping) |
currency.synthesize |
1 (per-rule mapping with supersession context) |
citation_integrity.synthesize |
1 (per-tuple mapping) |
| Top-level dispatch | 4 (all primitives invoked, archetype weights, sort order, empty catalog) |
| Question dataclass + IDs | 2 (id uniqueness, default values) |
All 19 cases passing. Full suite: 292 pass, no regressions.
Public API¶
from app.auditforge.synthesizer import synthesize_questions, Question, QuestionScope
from app.auditforge.primitives import conflict, coverage, currency, ...
# Top-level dispatch (recommended)
questions = await synthesize_questions(catalog, intake, archetype, llm)
# Direct per-primitive (testing or specialized flows)
qs = await conflict.synthesize(catalog.concepts, intake, llm)
Known limits / future work¶
- All synthesis is template-based. For complex catalogs we may want LLM-driven question synthesis where the LLM proposes specialized per-target prompts. Phase 2 hardening if template prompts plateau.
- No question deduplication at synthesis time — Stage D handles dedupe via embedding similarity. This means very similar concepts may produce near-duplicate questions until Stage D catches them. Acceptable trade-off; keeps Stage C deterministic.
flow_down_checkexpansion by clause-class can produce many questions for one DocPair with many classes. The MAX_DOC_PAIRS=15 catalog cap combined with the per-pair clause-class list bounds this; in extreme cases Stage D dedupe will prune.