Skip to content

Iterative Retrieval

Status: ✅ Complete Files: app/auditforge/orchestrator.py (_run_question, _retrieve_with_dedupe) Tests: tests/test_auditforge_iterative_retrieval.py — 10 cases passing

Why this exists

Single-shot retrieval misses findings that require evidence the initial query didn't surface. Example: a conflict_check question retrieves the master contract's clause on cybersecurity training but not the subcontract's. Without seeing both, the model can't determine contradiction. With single-shot retrieval the question silently produces a no-finding response — the audit misses a real conflict.

Iterative retrieval gives the model the option to request additional retrievals when initial context is insufficient. The model emits specific search queries; the orchestrator runs them, dedupes against existing chunks, augments the prompt, and re-runs. Bounded by max_followup_rounds (default 2).

This is a recall move: it surfaces findings that single-pass retrieval cannot. The cost is bounded — typical questions converge in round 0 without follow-ups; only questions that genuinely need more context trigger the extra calls.

How the model requests more evidence

The runtime prompt includes an optional alternative output format (appended only when rounds remain):

## OPTIONALLY: Request more evidence

If the retrieved context above is genuinely insufficient to make a
confident determination — for example, you can see the start of a
contradiction but not its other side, or you need to verify a referenced
clause not present here — respond with this alternative JSON instead of
the finding JSON:

{"action": "request_more_evidence", "queries": ["specific phrase 1",
"specific phrase 2"]}

Up to 3 queries. Each must be a SPECIFIC retrieval query (not a question
to a human). Use sparingly — only when additional retrieval would
materially change your answer. Otherwise, respond with the finding JSON
as specified above.

The instruction is biased toward giving an answer ("Use sparingly"). The suffix is not appended on the final allowed round, forcing the model to commit to a finding/no-finding response.

Pipeline

question (with stashed retrieval_results from Stage D)
round 0:
  format prompt + iterative-retrieval suffix
  LLM call (REASONING_MID)
  parse response
       ├─ action="request_more_evidence" + rounds_remaining?
       │      │
       │      ▼
       │   _retrieve_with_dedupe(queries, existing_chunks)
       │      - run each query against retriever
       │      - skip chunks whose faiss_id is already present
       │      - cap at _MAX_CHUNKS_PER_QUESTION (15) total
       │      │
       │      ├─ produced new chunks → augment context, loop to round 1
       │      └─ no new chunks → loop to round N+1 (giving model another
       │                          chance to commit)
       └─ found_X / no found_X → break, build_finding
build_finding (or None) — same evidence-verification + adversarial flow

_retrieve_with_dedupe behavior

Pure-ish function (calls retriever async). Properties:

  • Deduplication by faiss_id: chunks already in existing_chunks skipped. Prevents the model from getting the same chunk twice in augmented context.
  • Bounded total: existing_chunks + new_chunks ≤ cap (default 15). Keeps prompt size predictable across deepening loops.
  • Per-query cap: each query contributes up to per_query_top_k (4) chunks before dedup. Bounds the impact of a single noisy query.
  • Failure isolation: one query throwing doesn't kill the others.

Round bound

max_followup_rounds controls how many follow-up rounds are allowed: - 0 — single-pass behavior (suffix never appended) - 1 — one follow-up allowed (prompt + retrieve + final = 2 LLM calls max) - 2 (default) — two follow-ups (3 LLM calls max per question)

The suffix is appended on rounds 0..max-1; round max (the final round) runs the base prompt only, forcing the model to give a finding/no-finding response.

If the model still attempts request_more_evidence on the final round, build_finding sees the JSON has no found_X flag and returns None (treated as no-finding).

Cost shape

Per question: - Round 0 only (typical, no follow-up): 1 LLM call. ~$0.024. - Round 0 + 1 (one follow-up): 2 LLM calls + 1 retrieval. ~$0.048. - All 3 rounds: 3 LLM calls + 2 retrievals. ~$0.072.

Real distribution (estimate, pre-dogfooding): ~70% questions converge in round 0, ~25% need 1 follow-up, ~5% need 2. Average overhead: 0.7×1 + 0.25×2 + 0.05×3 = 1.35× → ~35% cost increase per audit.

For 80-question audits: cost goes from ~$1.92 to ~$2.59 per iteration. Across 3 iterations: ~$5.76 → ~$7.78. Total audit cost rises from ~$10 (post-adversarial) to ~$12. Acceptable for the recall gain.

Question state mutation

question.retrieval_results is augmented in place across rounds. After execution, the question carries the union of original Stage D chunks + all follow-up chunks. Stage F deepening (which clusters findings by evidence overlap) sees the full evidence set when analyzing related findings.

question.llm_response carries only the final response text (the last LLM output before the loop exits). Earlier round responses (request_more_evidence ones) are discarded — they're already reflected in the chunk list.

Integration points

Layer Parameter
_run_question max_followup_rounds: int = 2
run_investigation max_followup_rounds: int = 2 (passes through)
run_audit max_followup_rounds: int = 2 (passes through)

Disable globally with max_followup_rounds=0 on run_audit. Future: per-primitive override (some primitives benefit more than others; e.g., flow_down_check's paired retrieval pattern is the natural fit).

Test coverage

Area Cases
_retrieve_with_dedupe 3 (skip existing by id, cap total, isolate failures)
Single round (no follow-up) 1
Follow-up triggers retrieval + finding 1
Max rounds bound 1
Disabled (max_followup_rounds=0) 1
No new evidence on follow-up 1
Round metadata captured per call 1
Malformed request 1

All 10 cases passing. Full suite: 453 pass, no regressions.

Public API

No new public function — iterative retrieval is internal to _run_question. The behavior is controlled via max_followup_rounds on run_investigation / run_audit.

Known limits / future work

  • Per-primitive tuning. Different primitives benefit from different follow-up budgets. flow_down_check (paired retrieval pattern) likely benefits from 3+ rounds; coverage_check (binary presence/absence) often needs 0. Make max_followup_rounds per-primitive in v2.
  • Query quality is variable. The model proposes specific phrases; query quality influences retrieval recall. Logging follow-up queries
  • which produced new findings will inform prompt refinement.
  • Cap on total chunks (15) is fixed. Some genuinely complex questions might benefit from 25-30 chunks; some need only 5. Adaptive cap based on question complexity is a future refinement.
  • Cost accounting is per-call, not per-round. RunResult doesn't surface "questions that took N rounds" telemetry. Adding to IterationRecord would help tune max_followup_rounds.
  • No semantic dedup across queries. Two follow-up queries returning near-duplicate (but different faiss_id) chunks both get added. Could add embedding-based dedup at the augmented-context level.