Skip to content

Phase 25–27 — Self-serve corpus onboarding

Last updated: 2026-05-10

Closes the biggest gap between "demo product" and "billable product" identified during the GTM review: corpus ingestion used to be manual ops work — Base2ML did it for every prospect over a day or two. With Option C complete, a partner-firm admin can self-serve the entire onboarding flow inside the AuditForge UI:

  1. Click + Upload corpus on the Engagements tab
  2. Name the engagement, drag-and-drop files
  3. Click Start ingest → background task chunks + indexes
  4. SSE stream surfaces progress
  5. Land in the engagement detail view with the corpus already bound — fill intake, click Start audit

End-to-end: a partner with 20–200 documents goes from "I have a directory of PDFs" to "I'm running an audit" in 5–10 minutes, no Base2ML involvement.

Endpoints

POST   /auditforge/engagement/{id}/corpus/upload    # multipart, one file per request
DELETE /auditforge/engagement/{id}/corpus/file/{filename}
POST   /auditforge/engagement/{id}/corpus/ingest    # 202 + background task
GET    /auditforge/engagement/{id}/corpus/stream    # SSE progress

All four accept either admin-token or per-user session-token auth; partner+ role required for mutations (associates 403 per Phase 15). Refused when the engagement is frozen (Phase 20) or already ingested.

Upload constraints

Constraint Value Why
Allowed extensions .txt .md .pdf .docx .csv .xlsx .png .jpg .jpeg .tiff .eml .mbox Same set the existing ingest pipeline handles
Max per-file size 50 MB One HTTP request per file; 50MB stays well under ALB's 30-minute request-body limit even on slow links
Max files per engagement 500 Conservative ceiling; real corpora are usually 20–200 docs
Duplicate filenames Rejected with 409 Forces partner to rename or delete; avoids silent overwrite

Per-file upload (one HTTP request per file) means a network blip on file 47 of 50 only loses file 47 — the frontend retries that file individually. This is "chunked enough" without the complexity of true resumable multipart.

Corpus state machine

empty → uploading → uploaded → ingesting → ingested
                            ↘ ingest_failed

engagement.corpus.status exposes this state. Once ingested: - engagement.client_id is set to engagement.id (the corpus IS the engagement) - The audit-run endpoint accepts that client_id and pulls indexes from the standard location - Further uploads / deletes are blocked (immutable corpus for that engagement)

Storage layout

# Per-engagement isolated bucket (Phase 7, now default-on per Phase 25 deploy)
metis-af-<engagement_id>-<account_id>/
    auditforge/engagements/<engagement_id>/
        source/                         # uploaded source documents (Phase 25)
        findings.json                   # canonical findings (existing)
        audit_log/shard-*.jsonl         # per-LLM-call audit log (existing)

# Shared platform bucket — corpus indexes (existing PilotForge pattern)
mobilemetis-metis-indexes-<account_id>/
    <engagement_id>/
        index/index.faiss               # FAISS embedding index
        index/chunks.json               # chunk metadata
        index/bm25.pkl                  # BM25 sparse index

Source documents (the raw client data) live in the isolated bucket. Derived embeddings live in the shared bucket — the indexes don't contain raw client content, just numeric vectors and chunk metadata. Future hardening could move indexes into the per-engagement bucket as well; not in scope for Option C.

Async ingestion

POST /corpus/ingest returns 202 immediately. A FastAPI BackgroundTask runs the existing ingest.ingest.full_rebuild(slug=engagement_id, ...) pipeline (the same one PilotForge has used for SMB demos). Stage events emit onto a per-engagement asyncio.Queue:

{type: "ingest_start", file_count: N, total_bytes: M}
{type: "ingest_stage", stage: "loading",    message: "Loading N documents"}
{type: "ingest_stage", stage: "chunking",   message: "Chunking documents and building indexes"}
{type: "ingest_stage", stage: "finalizing", message: "Persisting index manifest"}
{type: "ingest_complete", engagement_id: "...", client_id: "..."}
# or
{type: "ingest_failed", error: "..."}

The events are stage transitions, not per-file granularity — full_rebuild is internally batched and doesn't emit per-file hooks. Per-file progress would require modifying the ingest pipeline; deliberately deferred since stage-level visibility is enough for "is anything happening" UX.

SSE stream

GET /corpus/stream emits the queue events as Server-Sent Events. Late subscribers get a snapshot event first so the UI can recover its state. A 30-second keepalive prevents proxy timeouts during long chunking phases. Closes when ingest_complete or ingest_failed arrives.

Frontend falls back to polling GET /engagement/{id} every 5 seconds if the SSE connection drops — same pattern as the audit-run stream.

UI

A new CorpusUploadModal component handles the entire flow:

  1. Setup: pick firm, name client → creates engagement record
  2. Upload: drag-and-drop or file picker → per-file upload with live status (pending/uploading/done/error). Per-file Retry button on failures. Remove uploaded files individually.
  3. Ingest: "Start ingest" → 202 → SSE progress → stage messages render
  4. Done: green "Corpus ingested" banner → "Open engagement" drops the partner into EngagementDetail

The modal is reachable from the Engagements tab via a new + Upload corpus button next to + New engagement.

What this doesn't do (deferred)

  • Classification override UI: full_rebuild does doc-type/jurisdiction classification internally; partners can't see or correct those choices pre-ingest. The existing PilotForge override mechanism could be wired in if real corpora reveal misclassification problems — deliberately deferred until we have evidence it bites.
  • True resumable multipart uploads: per-file granularity is the pragmatic substitute. A single 50MB PDF that drops mid-upload requires re-uploading that 50MB. Worth fixing if real corpora include many large files.
  • Per-engagement bucket for corpus indexes: indexes still live in the shared bucket. Source documents are isolated; indexes are derived embeddings without raw content. Hardening pass when a customer asks for it.

Files

  • app/auditforge/engagement.py — new CorpusStatus dataclass; EngagementStore.update_corpus
  • app/auditforge_endpoints.py — upload, delete, ingest, stream endpoints; per-engagement asyncio.Queue for SSE
  • frontend/src/api/auditforge.tsCorpusStatus, uploadCorpusFile, ingestCorpus, deleteCorpusFile, CorpusProgressEvent
  • frontend/src/components/CorpusUploadModal.tsx — drag-drop UI, per-file retry, SSE consumer with polling fallback
  • frontend/src/components/EngagementList.tsx+ Upload corpus button
  • frontend/src/components/AuditForge.tsx — modal wiring
  • tests/test_auditforge_endpoints.py — 12 endpoint tests covering auth/role gating, validation, dedupe, lifecycle

Cost impact

  • AWS storage: ~$0.023/GB/mo. At 100 engagements × 500 MB = 50 GB = $1.15/mo
  • S3 PUT requests for uploads: rounding error
  • No new compute, no new queue service
  • LLM cost during ingest: unchanged (same chunking/classification pipeline as PilotForge)