Stage D — Validate¶
Status: ✅ Complete
File: app/auditforge/validator.py
Tests: tests/test_auditforge_validator.py — 23 cases passing
Purpose¶
Stage D is a pre-flight on synthesized questions. It drops questions that won't produce useful findings before Stage E spends full LLM budget on them. Two filters in order (cheap → expensive elimination):
- Relevance floor — retrieve corpus chunks for each question's scope- aware query; drop if no chunk scores above the floor (the corpus has no purchase on this question).
- Near-duplicate dedupe — embed each question's
dimension, greedy-cluster by cosine similarity, keep the highest-priority representative per cluster.
Information-gain scoring is reserved for a future iteration once we have real prior-findings density data to drive thresholds.
Output: ValidationResult¶
@dataclass
class ValidationResult:
questions: list[Question] # passed all filters
dropped: list[tuple[Question, str]] # (question, reason)
Reasons come in three flavors:
- "no retrieval results"
- "max relevance X.XXX < floor Y.YYY"
- "near-dup of q-XXXX (sim=Z.ZZZ)"
The dropped log is preserved on the engagement so the auditor can audit the validator's decisions if a finding seems missing.
Pipeline¶
list[Question]
│
▼
build_relevance_query(question) ◄── per-primitive query string
│
▼
retriever(client_id, query) ◄── concurrent (asyncio.gather)
│
▼
score = max(_rerank_score for top-5 chunks)
score < floor? ──► drop with reason
score ≥ floor? ──► stash chunks on question.retrieval_results
│
▼ (Stage E reuses these — saves a retrieval round-trip per question)
embedder(dimensions) ◄── default: Metis SentenceTransformer
│
▼
greedy similarity clustering
within cluster: keep max(archetype_weight × severity_weight)
drop the rest
│
▼
ValidationResult
Per-primitive relevance queries (build_relevance_query)¶
Pure function rendering the query string used for the relevance check.
Per-primitive selection of which prompt_variables to concatenate:
| Primitive | Query construction |
|---|---|
conflict_check |
concept_label seed_terms_str (skip seeds if "(none)") |
consistency_check |
term |
coverage_check |
element_name description (skip default desc) |
currency_check |
subject |
flow_down_check |
clause_class parent_doc_type child_doc_type |
citation_integrity_check |
citing_doc cited_target (strip kind: prefix) |
| (unknown) | dimension (fallback) |
Retrieval reuse¶
A meaningful optimization: Stage D already retrieves chunks to score
relevance. Surviving questions stash those chunks on
question.retrieval_results. Stage E reuses them rather than retrieving
again — one less round-trip per question to the FAISS index.
For questions where Stage E needs additional retrieval (e.g., flow_down's paired retrieval over parent and child docs), Stage E augments. The Stage D stash is a starting set, not a constraint.
Embedding for dedupe¶
Default embedder is _default_embedder, which lazy-imports the cached
Metis SentenceTransformer (_get_embedding_model). This means the model
only loads when needed; tests pass synthetic embedders to skip the load.
The embedder signature is:
Embedder = Callable[[list[str]], np.ndarray]
# Returns shape (n, d) of L2-normalized vectors so cosine == inner product.
Dedupe over question.dimension was chosen because it's the most
information-dense human-readable identifier per question (e.g.,
"conflict: cybersecurity training" vs. "q-3f8a2b"). For richer
dedupe, future versions could embed the rendered relevance query
instead.
Greedy similarity clustering¶
Pairwise cosine similarity matrix → greedy clustering:
for i in range(n):
if dropped[i]: continue
for j in range(i+1, n):
if dropped[j]: continue
if sim[i, j] >= threshold:
keep the higher-priority of (i, j); drop the other
This is O(n²) but bounded by Stage C's question count cap (typically
50–150 questions per audit, well under any throughput concern).
threshold=0.92 default. Empirically a high threshold avoids over-
aggressive deduplication; close-but-not-identical questions can produce
different findings via different retrieval paths.
Failure isolation¶
- Retriever raising for one query: caught by
_safe_retrieve, returns empty list → question dropped with "no retrieval results." - Embedder raising or returning wrong shape: caught, returns the input unchanged (skip dedupe). Better to over-keep than abort the audit.
LLM cost¶
Zero. Stage D is purely retrieval + numerical operations. The retrievals themselves don't hit the LLM — they're FAISS + BM25 hybrid search inside Metis's existing infrastructure.
This makes Stage D the cheapest stage to run, which is why it should eliminate as many questions as possible before Stage E's expensive LLM-driven investigation.
Test coverage¶
| Area | Cases |
|---|---|
build_relevance_query |
9 (per-primitive, "(none)" skip, kind: prefix strip, fallback) |
_chunk_score |
4 (rerank preferred, fallback, missing, garbage) |
| Relevance filter | 4 (drops low score, stashes results, empty results, retriever failure isolation) |
| Dedupe | 4 (no dups, all dups, grouped dups, single question) |
| Top-level pipeline | 2 (full pipeline drops + dedup, empty input) |
All 23 cases passing. Full suite: 315 pass, no regressions.
Public API¶
async def validate_questions(
questions: list[Question],
*,
client_id: str,
retriever: Retriever,
llm: LLMClient,
prior_findings: list[Finding] | None = None,
relevance_floor: float = 0.35,
dedupe_threshold: float = 0.92,
embedder: Embedder | None = None,
) -> ValidationResult
llm and prior_findings are accepted for interface symmetry (info-gain
scoring will use them in the future) but unused at v1.
Tunable thresholds¶
Both thresholds are exposed as kwargs. Defaults work for typical engagements but can be tuned per archetype or per engagement based on auditor feedback:
relevance_floor=0.35— too low → execution wastes budget on noise; too high → useful questions dropped. Cross-encoder reranker scores from Metis fall in the 0.0–1.0 range with most relevant chunks at 0.5+.dedupe_threshold=0.92— too low → distinct questions collapsed; too high → near-duplicates pass through and waste Stage E budget.
Known limits / future work¶
- No information-gain scoring at v1. Real audit data will inform whether to add it. Right hypothesis: in iteration 2+, IG scoring drops the bottom decile by expected gain given prior findings density.
- Dedupe uses dimension only. Richer dedupe would embed the rendered relevance query or the prompt template variables. Acceptable now; revisit if dedupe quality plateaus.
- No question merging. When two near-duplicate questions are detected, we drop one. We could instead merge them (combine seed terms, broaden scope). Adds complexity for unclear quality benefit.