Stage A — Corpus Profile¶
Status: ✅ Complete (commit f337f2c)
File: app/auditforge/profiler.py
Tests: tests/test_auditforge_profiler.py — 30 cases passing
Purpose¶
Stage A builds a structured description of the corpus shape that drives Stage B
(catalog) target generation. Output is a CorpusProfile dataclass.
Pipeline¶
client_id ──► ClientIndexCache.get_or_load ──► (FAISS index + chunk metadata)
│
┌──────────────────┬───────────────┴──────────────┬──────────────┐
▼ ▼ ▼ ▼
aggregate_metadata extract_citations cluster_chunks stratified_sample
(pure, free) (pure, regex, free) (faiss.Kmeans) (pure, free)
│ │ │ │
│ │ ▼ ▼
│ │ _label_cluster _infer_domain
│ │ (LLM × clusters) (LLM × 1)
│ │ │ │
└──────────────────┴──────────────────────────────┴──────────────┘
│
▼
CorpusProfile
LLM cost is bounded: ~25 cluster-label calls (MECHANICAL) + 1 domain-inference call (REASONING_MID) per profile run. Roughly $0.02–0.05 even on a 10K-chunk corpus.
Output: CorpusProfile¶
@dataclass
class CorpusProfile:
engagement_id: str
total_docs: int
total_chunks: int
doc_type_distribution: dict[str, int]
date_histogram: dict[str, int] # year → count
jurisdiction_distribution: dict[str, int]
clusters: list[dict] # see below
citations: list[dict] # see below
inferred_domain: str
inferred_audit_type: str
inferred_frameworks: list[str]
Cluster shape¶
{
"id": int,
"size": int, # member count
"label": str, # 2–6 word LLM-generated topic label
"representative_chunks": list[dict], # ~3 chunks closest to centroid
"member_faiss_ids": list[int], # full membership for downstream stages
}
Citation shape¶
{
"from_doc": str, # citing doc title or path
"from_section": str, # citing section heading
"to_target": str, # cited identifier (e.g. "52.204-21")
"kind": str, # far|dfars|nist|iso|hipaa|aba|section_ref
}
Implementation details¶
Embeddings reconstruction¶
Metis stores FAISS as IndexIDMap(IndexFlatIP). The IDMap layer doesn't
support reconstruct(user_id) — that path falls through to the abstract
base and raises "reconstruct not implemented." Workaround:
- Downcast to inner index:
faiss.downcast_index(index.index) - Bulk-reconstruct:
inner.reconstruct_n(0, inner.ntotal) - Map user_id → internal position via
faiss.vector_to_array(index.id_map)
The inner IndexFlatIP supports reconstruct_n directly. _reconstruct_embeddings
encapsulates this; works for both IndexIDMap and bare IndexFlatIP.
Clustering (cluster_chunks)¶
faiss.Kmeans(avoids sklearn dependency)- Cluster count via
_suggested_k(n_chunks, max_k=25): - Below 10 chunks → 1 (clustering not meaningful)
- Above →
max(2, min(max_k, sqrt(n_chunks))), floored byn_chunks // 5 - Sample sizes: 10 → 2, 40 → 6, 100 → 10, 1K → 25 (capped)
- Per-cluster representatives: 3 chunks closest to centroid by inner-product similarity (vectors are L2-normalized at ingest)
- Centroid vectors are dropped before serialization (~384 floats per cluster × clusters = unnecessary bloat)
Citation extraction (extract_citations)¶
Six framework patterns + generic section/clause references. Pure regex,
deterministic, dedupes within (from_doc, to_target, kind) triples.
| Pattern | Examples |
|---|---|
far |
"FAR 52.204-21", "FAR Part 52.204", "FAR 52.204-21(b)(1)(viii)" |
dfars |
"DFARS 252.204-7012", "DFARS subpart 252.204" |
nist |
"NIST SP 800-171 r2", "NIST 800-53" |
iso |
"ISO 27001", "ISO/IEC 27001:2022" |
hipaa |
"HIPAA Section 164.312", "45 CFR 164.502" |
aba |
"ABA Rule 1.6", "ABA Model Rule 1.6" |
section_ref |
"Section 3.2", "§ 3.2", "Article IV.B", "Clause 4.1.1" |
External-source verification (calling APIs like ecfr.gov to validate clause text) is a Phase 2 hardening item. Adds defensibility but introduces latency and external-API dependency.
Domain inference (_infer_domain)¶
Stratified sample (n=24 default) → REASONING_MID prompt with intake biasing
→ JSON parse → (domain, audit_type, frameworks[]).
Prompt biases: when intake has domain, frameworks, or audit_purpose
populated, the model is told to "treat as prior unless excerpts contradict."
This way intake quality directly improves profile output without overriding
clear corpus signals.
JSON parsing is tolerant of code fences, leading prose, and field truncation; falls back to empty fields on unparseable responses (rare with REASONING_MID quality but defensive).
Public API¶
async def profile_corpus(
engagement_id: str,
client_id: str,
intake: IntakeData,
llm: LLMClient,
*,
max_clusters: int = 25,
domain_sample_size: int = 24,
) -> CorpusProfile
Loads corpus via ClientIndexCache.get_or_load(client_id) and delegates to
the inner _profile_loaded_corpus (which is what tests call directly with
synthetic data). Per-engagement bucket loading via dedicated buckets is
deferred to Phase 4 hardening; for now an AuditForge engagement reuses the
existing client_id mechanism.
Test coverage¶
| Area | Cases |
|---|---|
| Citation extraction | 8 (per-framework, dedupe, multi-doc, empty, section_ref) |
| Metadata aggregation | 4 (distributions, fallbacks, jurisdiction optional, empty) |
| Stratified sample | 4 (balance, zero-n, empty, determinism) |
| Suggested k | 4 (tiny→1, capped, sqrt scaling, min size floor) |
| Domain response parse | 6 (clean, fenced, prose, garbage, length truncation, framework cap) |
| Cluster recovery | 3 (planted clusters, empty, tiny single-cluster) |
| End-to-end (mocked LLM) | 1 (100-chunk synthetic corpus through full pipeline) |
The end-to-end test verifies stage glue: metadata + citations + clusters + labels + domain inference all populate, MECHANICAL tier used for cluster labels, REASONING_MID for domain inference, and call counts match expected (N cluster_label calls + 1 domain_inference call).
Known limits / future work¶
- Citation extraction is regex-only. Some legal/regulatory citation styles fall outside our patterns. Phase 1.5 hardening: add LLM-based citation extraction as a fallback for ambiguous chunks.
- No external citation verification yet (e.g., calling ecfr.gov for FAR text). Phase 2 hardening.
- Cluster labels can be noisy at MECHANICAL tier on edge content. If quality
becomes a problem, escalate to REASONING_MID via
model_override. - Per-engagement corpus isolation reuses
client_id; switch toengagement.index_prefix-based loading in Phase 4 hardening.