Skip to content

Stage D — Validate

Status: ✅ Complete File: app/auditforge/validator.py Tests: tests/test_auditforge_validator.py — 23 cases passing

Purpose

Stage D is a pre-flight on synthesized questions. It drops questions that won't produce useful findings before Stage E spends full LLM budget on them. Two filters in order (cheap → expensive elimination):

  1. Relevance floor — retrieve corpus chunks for each question's scope- aware query; drop if no chunk scores above the floor (the corpus has no purchase on this question).
  2. Near-duplicate dedupe — embed each question's dimension, greedy-cluster by cosine similarity, keep the highest-priority representative per cluster.

Information-gain scoring is reserved for a future iteration once we have real prior-findings density data to drive thresholds.

Output: ValidationResult

@dataclass
class ValidationResult:
    questions: list[Question]                 # passed all filters
    dropped: list[tuple[Question, str]]       # (question, reason)

Reasons come in three flavors: - "no retrieval results" - "max relevance X.XXX < floor Y.YYY" - "near-dup of q-XXXX (sim=Z.ZZZ)"

The dropped log is preserved on the engagement so the auditor can audit the validator's decisions if a finding seems missing.

Pipeline

list[Question]
build_relevance_query(question)             ◄── per-primitive query string
retriever(client_id, query)                  ◄── concurrent (asyncio.gather)
score = max(_rerank_score for top-5 chunks)
score < floor? ──► drop with reason
score ≥ floor? ──► stash chunks on question.retrieval_results
      ▼  (Stage E reuses these — saves a retrieval round-trip per question)
embedder(dimensions)                         ◄── default: Metis SentenceTransformer
greedy similarity clustering
       within cluster: keep max(archetype_weight × severity_weight)
       drop the rest
ValidationResult

Per-primitive relevance queries (build_relevance_query)

Pure function rendering the query string used for the relevance check. Per-primitive selection of which prompt_variables to concatenate:

Primitive Query construction
conflict_check concept_label seed_terms_str (skip seeds if "(none)")
consistency_check term
coverage_check element_name description (skip default desc)
currency_check subject
flow_down_check clause_class parent_doc_type child_doc_type
citation_integrity_check citing_doc cited_target (strip kind: prefix)
(unknown) dimension (fallback)

Retrieval reuse

A meaningful optimization: Stage D already retrieves chunks to score relevance. Surviving questions stash those chunks on question.retrieval_results. Stage E reuses them rather than retrieving again — one less round-trip per question to the FAISS index.

For questions where Stage E needs additional retrieval (e.g., flow_down's paired retrieval over parent and child docs), Stage E augments. The Stage D stash is a starting set, not a constraint.

Embedding for dedupe

Default embedder is _default_embedder, which lazy-imports the cached Metis SentenceTransformer (_get_embedding_model). This means the model only loads when needed; tests pass synthetic embedders to skip the load.

The embedder signature is:

Embedder = Callable[[list[str]], np.ndarray]
# Returns shape (n, d) of L2-normalized vectors so cosine == inner product.

Dedupe over question.dimension was chosen because it's the most information-dense human-readable identifier per question (e.g., "conflict: cybersecurity training" vs. "q-3f8a2b"). For richer dedupe, future versions could embed the rendered relevance query instead.

Greedy similarity clustering

Pairwise cosine similarity matrix → greedy clustering:

for i in range(n):
    if dropped[i]: continue
    for j in range(i+1, n):
        if dropped[j]: continue
        if sim[i, j] >= threshold:
            keep the higher-priority of (i, j); drop the other

This is O(n²) but bounded by Stage C's question count cap (typically 50–150 questions per audit, well under any throughput concern).

threshold=0.92 default. Empirically a high threshold avoids over- aggressive deduplication; close-but-not-identical questions can produce different findings via different retrieval paths.

Failure isolation

  • Retriever raising for one query: caught by _safe_retrieve, returns empty list → question dropped with "no retrieval results."
  • Embedder raising or returning wrong shape: caught, returns the input unchanged (skip dedupe). Better to over-keep than abort the audit.

LLM cost

Zero. Stage D is purely retrieval + numerical operations. The retrievals themselves don't hit the LLM — they're FAISS + BM25 hybrid search inside Metis's existing infrastructure.

This makes Stage D the cheapest stage to run, which is why it should eliminate as many questions as possible before Stage E's expensive LLM-driven investigation.

Test coverage

Area Cases
build_relevance_query 9 (per-primitive, "(none)" skip, kind: prefix strip, fallback)
_chunk_score 4 (rerank preferred, fallback, missing, garbage)
Relevance filter 4 (drops low score, stashes results, empty results, retriever failure isolation)
Dedupe 4 (no dups, all dups, grouped dups, single question)
Top-level pipeline 2 (full pipeline drops + dedup, empty input)

All 23 cases passing. Full suite: 315 pass, no regressions.

Public API

async def validate_questions(
    questions: list[Question],
    *,
    client_id: str,
    retriever: Retriever,
    llm: LLMClient,
    prior_findings: list[Finding] | None = None,
    relevance_floor: float = 0.35,
    dedupe_threshold: float = 0.92,
    embedder: Embedder | None = None,
) -> ValidationResult

llm and prior_findings are accepted for interface symmetry (info-gain scoring will use them in the future) but unused at v1.

Tunable thresholds

Both thresholds are exposed as kwargs. Defaults work for typical engagements but can be tuned per archetype or per engagement based on auditor feedback:

  • relevance_floor=0.35 — too low → execution wastes budget on noise; too high → useful questions dropped. Cross-encoder reranker scores from Metis fall in the 0.0–1.0 range with most relevant chunks at 0.5+.
  • dedupe_threshold=0.92 — too low → distinct questions collapsed; too high → near-duplicates pass through and waste Stage E budget.

Known limits / future work

  • No information-gain scoring at v1. Real audit data will inform whether to add it. Right hypothesis: in iteration 2+, IG scoring drops the bottom decile by expected gain given prior findings density.
  • Dedupe uses dimension only. Richer dedupe would embed the rendered relevance query or the prompt template variables. Acceptable now; revisit if dedupe quality plateaus.
  • No question merging. When two near-duplicate questions are detected, we drop one. We could instead merge them (combine seed terms, broaden scope). Adds complexity for unclear quality benefit.