Skip to content

Stage A — Corpus Profile

Status: ✅ Complete (commit f337f2c) File: app/auditforge/profiler.py Tests: tests/test_auditforge_profiler.py — 30 cases passing

Purpose

Stage A builds a structured description of the corpus shape that drives Stage B (catalog) target generation. Output is a CorpusProfile dataclass.

Pipeline

client_id ──► ClientIndexCache.get_or_load ──► (FAISS index + chunk metadata)
                ┌──────────────────┬───────────────┴──────────────┬──────────────┐
                ▼                  ▼                              ▼              ▼
         aggregate_metadata   extract_citations              cluster_chunks   stratified_sample
         (pure, free)         (pure, regex, free)            (faiss.Kmeans)   (pure, free)
                │                  │                              │              │
                │                  │                              ▼              ▼
                │                  │                       _label_cluster   _infer_domain
                │                  │                       (LLM × clusters) (LLM × 1)
                │                  │                              │              │
                └──────────────────┴──────────────────────────────┴──────────────┘
                                       CorpusProfile

LLM cost is bounded: ~25 cluster-label calls (MECHANICAL) + 1 domain-inference call (REASONING_MID) per profile run. Roughly $0.02–0.05 even on a 10K-chunk corpus.

Output: CorpusProfile

@dataclass
class CorpusProfile:
    engagement_id: str
    total_docs: int
    total_chunks: int
    doc_type_distribution: dict[str, int]
    date_histogram: dict[str, int]            # year → count
    jurisdiction_distribution: dict[str, int]

    clusters: list[dict]                      # see below
    citations: list[dict]                     # see below

    inferred_domain: str
    inferred_audit_type: str
    inferred_frameworks: list[str]

Cluster shape

{
    "id": int,
    "size": int,                              # member count
    "label": str,                             # 2–6 word LLM-generated topic label
    "representative_chunks": list[dict],      # ~3 chunks closest to centroid
    "member_faiss_ids": list[int],            # full membership for downstream stages
}

Citation shape

{
    "from_doc": str,                          # citing doc title or path
    "from_section": str,                      # citing section heading
    "to_target": str,                         # cited identifier (e.g. "52.204-21")
    "kind": str,                              # far|dfars|nist|iso|hipaa|aba|section_ref
}

Implementation details

Embeddings reconstruction

Metis stores FAISS as IndexIDMap(IndexFlatIP). The IDMap layer doesn't support reconstruct(user_id) — that path falls through to the abstract base and raises "reconstruct not implemented." Workaround:

  1. Downcast to inner index: faiss.downcast_index(index.index)
  2. Bulk-reconstruct: inner.reconstruct_n(0, inner.ntotal)
  3. Map user_id → internal position via faiss.vector_to_array(index.id_map)

The inner IndexFlatIP supports reconstruct_n directly. _reconstruct_embeddings encapsulates this; works for both IndexIDMap and bare IndexFlatIP.

Clustering (cluster_chunks)

  • faiss.Kmeans (avoids sklearn dependency)
  • Cluster count via _suggested_k(n_chunks, max_k=25):
  • Below 10 chunks → 1 (clustering not meaningful)
  • Above → max(2, min(max_k, sqrt(n_chunks))), floored by n_chunks // 5
  • Sample sizes: 10 → 2, 40 → 6, 100 → 10, 1K → 25 (capped)
  • Per-cluster representatives: 3 chunks closest to centroid by inner-product similarity (vectors are L2-normalized at ingest)
  • Centroid vectors are dropped before serialization (~384 floats per cluster × clusters = unnecessary bloat)

Citation extraction (extract_citations)

Six framework patterns + generic section/clause references. Pure regex, deterministic, dedupes within (from_doc, to_target, kind) triples.

Pattern Examples
far "FAR 52.204-21", "FAR Part 52.204", "FAR 52.204-21(b)(1)(viii)"
dfars "DFARS 252.204-7012", "DFARS subpart 252.204"
nist "NIST SP 800-171 r2", "NIST 800-53"
iso "ISO 27001", "ISO/IEC 27001:2022"
hipaa "HIPAA Section 164.312", "45 CFR 164.502"
aba "ABA Rule 1.6", "ABA Model Rule 1.6"
section_ref "Section 3.2", "§ 3.2", "Article IV.B", "Clause 4.1.1"

External-source verification (calling APIs like ecfr.gov to validate clause text) is a Phase 2 hardening item. Adds defensibility but introduces latency and external-API dependency.

Domain inference (_infer_domain)

Stratified sample (n=24 default) → REASONING_MID prompt with intake biasing → JSON parse → (domain, audit_type, frameworks[]).

Prompt biases: when intake has domain, frameworks, or audit_purpose populated, the model is told to "treat as prior unless excerpts contradict." This way intake quality directly improves profile output without overriding clear corpus signals.

JSON parsing is tolerant of code fences, leading prose, and field truncation; falls back to empty fields on unparseable responses (rare with REASONING_MID quality but defensive).

Public API

async def profile_corpus(
    engagement_id: str,
    client_id: str,
    intake: IntakeData,
    llm: LLMClient,
    *,
    max_clusters: int = 25,
    domain_sample_size: int = 24,
) -> CorpusProfile

Loads corpus via ClientIndexCache.get_or_load(client_id) and delegates to the inner _profile_loaded_corpus (which is what tests call directly with synthetic data). Per-engagement bucket loading via dedicated buckets is deferred to Phase 4 hardening; for now an AuditForge engagement reuses the existing client_id mechanism.

Test coverage

Area Cases
Citation extraction 8 (per-framework, dedupe, multi-doc, empty, section_ref)
Metadata aggregation 4 (distributions, fallbacks, jurisdiction optional, empty)
Stratified sample 4 (balance, zero-n, empty, determinism)
Suggested k 4 (tiny→1, capped, sqrt scaling, min size floor)
Domain response parse 6 (clean, fenced, prose, garbage, length truncation, framework cap)
Cluster recovery 3 (planted clusters, empty, tiny single-cluster)
End-to-end (mocked LLM) 1 (100-chunk synthetic corpus through full pipeline)

The end-to-end test verifies stage glue: metadata + citations + clusters + labels + domain inference all populate, MECHANICAL tier used for cluster labels, REASONING_MID for domain inference, and call counts match expected (N cluster_label calls + 1 domain_inference call).

Known limits / future work

  • Citation extraction is regex-only. Some legal/regulatory citation styles fall outside our patterns. Phase 1.5 hardening: add LLM-based citation extraction as a fallback for ambiguous chunks.
  • No external citation verification yet (e.g., calling ecfr.gov for FAR text). Phase 2 hardening.
  • Cluster labels can be noisy at MECHANICAL tier on edge content. If quality becomes a problem, escalate to REASONING_MID via model_override.
  • Per-engagement corpus isolation reuses client_id; switch to engagement.index_prefix-based loading in Phase 4 hardening.