Iterative Retrieval¶
Status: ✅ Complete
Files: app/auditforge/orchestrator.py (_run_question, _retrieve_with_dedupe)
Tests: tests/test_auditforge_iterative_retrieval.py — 10 cases passing
Why this exists¶
Single-shot retrieval misses findings that require evidence the initial query didn't surface. Example: a conflict_check question retrieves the master contract's clause on cybersecurity training but not the subcontract's. Without seeing both, the model can't determine contradiction. With single-shot retrieval the question silently produces a no-finding response — the audit misses a real conflict.
Iterative retrieval gives the model the option to request additional
retrievals when initial context is insufficient. The model emits
specific search queries; the orchestrator runs them, dedupes against
existing chunks, augments the prompt, and re-runs. Bounded by
max_followup_rounds (default 2).
This is a recall move: it surfaces findings that single-pass retrieval cannot. The cost is bounded — typical questions converge in round 0 without follow-ups; only questions that genuinely need more context trigger the extra calls.
How the model requests more evidence¶
The runtime prompt includes an optional alternative output format (appended only when rounds remain):
## OPTIONALLY: Request more evidence
If the retrieved context above is genuinely insufficient to make a
confident determination — for example, you can see the start of a
contradiction but not its other side, or you need to verify a referenced
clause not present here — respond with this alternative JSON instead of
the finding JSON:
{"action": "request_more_evidence", "queries": ["specific phrase 1",
"specific phrase 2"]}
Up to 3 queries. Each must be a SPECIFIC retrieval query (not a question
to a human). Use sparingly — only when additional retrieval would
materially change your answer. Otherwise, respond with the finding JSON
as specified above.
The instruction is biased toward giving an answer ("Use sparingly"). The suffix is not appended on the final allowed round, forcing the model to commit to a finding/no-finding response.
Pipeline¶
question (with stashed retrieval_results from Stage D)
│
▼
round 0:
format prompt + iterative-retrieval suffix
LLM call (REASONING_MID)
parse response
│
├─ action="request_more_evidence" + rounds_remaining?
│ │
│ ▼
│ _retrieve_with_dedupe(queries, existing_chunks)
│ - run each query against retriever
│ - skip chunks whose faiss_id is already present
│ - cap at _MAX_CHUNKS_PER_QUESTION (15) total
│ │
│ ├─ produced new chunks → augment context, loop to round 1
│ └─ no new chunks → loop to round N+1 (giving model another
│ chance to commit)
│
└─ found_X / no found_X → break, build_finding
│
▼
build_finding (or None) — same evidence-verification + adversarial flow
_retrieve_with_dedupe behavior¶
Pure-ish function (calls retriever async). Properties:
- Deduplication by faiss_id: chunks already in
existing_chunksskipped. Prevents the model from getting the same chunk twice in augmented context. - Bounded total:
existing_chunks + new_chunks ≤ cap(default 15). Keeps prompt size predictable across deepening loops. - Per-query cap: each query contributes up to
per_query_top_k(4) chunks before dedup. Bounds the impact of a single noisy query. - Failure isolation: one query throwing doesn't kill the others.
Round bound¶
max_followup_rounds controls how many follow-up rounds are allowed:
- 0 — single-pass behavior (suffix never appended)
- 1 — one follow-up allowed (prompt + retrieve + final = 2 LLM calls max)
- 2 (default) — two follow-ups (3 LLM calls max per question)
The suffix is appended on rounds 0..max-1; round max (the final round)
runs the base prompt only, forcing the model to give a finding/no-finding
response.
If the model still attempts request_more_evidence on the final round,
build_finding sees the JSON has no found_X flag and returns None
(treated as no-finding).
Cost shape¶
Per question: - Round 0 only (typical, no follow-up): 1 LLM call. ~$0.024. - Round 0 + 1 (one follow-up): 2 LLM calls + 1 retrieval. ~$0.048. - All 3 rounds: 3 LLM calls + 2 retrievals. ~$0.072.
Real distribution (estimate, pre-dogfooding): ~70% questions converge in round 0, ~25% need 1 follow-up, ~5% need 2. Average overhead: 0.7×1 + 0.25×2 + 0.05×3 = 1.35× → ~35% cost increase per audit.
For 80-question audits: cost goes from ~$1.92 to ~$2.59 per iteration. Across 3 iterations: ~$5.76 → ~$7.78. Total audit cost rises from ~$10 (post-adversarial) to ~$12. Acceptable for the recall gain.
Question state mutation¶
question.retrieval_results is augmented in place across rounds.
After execution, the question carries the union of original Stage D
chunks + all follow-up chunks. Stage F deepening (which clusters
findings by evidence overlap) sees the full evidence set when
analyzing related findings.
question.llm_response carries only the final response text (the
last LLM output before the loop exits). Earlier round responses
(request_more_evidence ones) are discarded — they're already reflected
in the chunk list.
Integration points¶
| Layer | Parameter |
|---|---|
_run_question |
max_followup_rounds: int = 2 |
run_investigation |
max_followup_rounds: int = 2 (passes through) |
run_audit |
max_followup_rounds: int = 2 (passes through) |
Disable globally with max_followup_rounds=0 on run_audit. Future:
per-primitive override (some primitives benefit more than others; e.g.,
flow_down_check's paired retrieval pattern is the natural fit).
Test coverage¶
| Area | Cases |
|---|---|
_retrieve_with_dedupe |
3 (skip existing by id, cap total, isolate failures) |
| Single round (no follow-up) | 1 |
| Follow-up triggers retrieval + finding | 1 |
| Max rounds bound | 1 |
Disabled (max_followup_rounds=0) |
1 |
| No new evidence on follow-up | 1 |
| Round metadata captured per call | 1 |
| Malformed request | 1 |
All 10 cases passing. Full suite: 453 pass, no regressions.
Public API¶
No new public function — iterative retrieval is internal to
_run_question. The behavior is controlled via max_followup_rounds
on run_investigation / run_audit.
Known limits / future work¶
- Per-primitive tuning. Different primitives benefit from different
follow-up budgets.
flow_down_check(paired retrieval pattern) likely benefits from 3+ rounds;coverage_check(binary presence/absence) often needs 0. Makemax_followup_roundsper-primitive in v2. - Query quality is variable. The model proposes specific phrases; query quality influences retrieval recall. Logging follow-up queries
- which produced new findings will inform prompt refinement.
- Cap on total chunks (15) is fixed. Some genuinely complex questions might benefit from 25-30 chunks; some need only 5. Adaptive cap based on question complexity is a future refinement.
- Cost accounting is per-call, not per-round. RunResult doesn't
surface "questions that took N rounds" telemetry. Adding to
IterationRecord would help tune
max_followup_rounds. - No semantic dedup across queries. Two follow-up queries returning near-duplicate (but different faiss_id) chunks both get added. Could add embedding-based dedup at the augmented-context level.