Phase 25–27 — Self-serve corpus onboarding¶

Last updated: 2026-05-10

Closes the biggest gap between "demo product" and "billable product" identified during the GTM review: corpus ingestion used to be manual ops work — Base2ML did it for every prospect over a day or two. With Option C complete, a partner-firm admin can self-serve the entire onboarding flow inside the AuditForge UI:

Click + Upload corpus on the Engagements tab
Name the engagement, drag-and-drop files
Click Start ingest → background task chunks + indexes
SSE stream surfaces progress
Land in the engagement detail view with the corpus already bound — fill intake, click Start audit

End-to-end: a partner with 20–200 documents goes from "I have a directory of PDFs" to "I'm running an audit" in 5–10 minutes, no Base2ML involvement.

Endpoints¶

POST   /auditforge/engagement/{id}/corpus/upload    # multipart, one file per request
DELETE /auditforge/engagement/{id}/corpus/file/{filename}
POST   /auditforge/engagement/{id}/corpus/ingest    # 202 + background task
GET    /auditforge/engagement/{id}/corpus/stream    # SSE progress

All four accept either admin-token or per-user session-token auth; partner+ role required for mutations (associates 403 per Phase 15). Refused when the engagement is frozen (Phase 20) or already ingested.

Upload constraints¶

Constraint	Value	Why
Allowed extensions	`.txt .md .pdf .docx .csv .xlsx .png .jpg .jpeg .tiff .eml .mbox`	Same set the existing ingest pipeline handles
Max per-file size	50 MB	One HTTP request per file; 50MB stays well under ALB's 30-minute request-body limit even on slow links
Max files per engagement	500	Conservative ceiling; real corpora are usually 20–200 docs
Duplicate filenames	Rejected with 409	Forces partner to rename or delete; avoids silent overwrite

Per-file upload (one HTTP request per file) means a network blip on file 47 of 50 only loses file 47 — the frontend retries that file individually. This is "chunked enough" without the complexity of true resumable multipart.

Corpus state machine¶

empty → uploading → uploaded → ingesting → ingested
                            ↘ ingest_failed

engagement.corpus.status exposes this state. Once ingested: - engagement.client_id is set to engagement.id (the corpus IS the engagement) - The audit-run endpoint accepts that client_id and pulls indexes from the standard location - Further uploads / deletes are blocked (immutable corpus for that engagement)

Storage layout¶

# Per-engagement isolated bucket (Phase 7, now default-on per Phase 25 deploy)
metis-af-<engagement_id>-<account_id>/
    auditforge/engagements/<engagement_id>/
        source/                         # uploaded source documents (Phase 25)
        findings.json                   # canonical findings (existing)
        audit_log/shard-*.jsonl         # per-LLM-call audit log (existing)

# Shared platform bucket — corpus indexes (existing PilotForge pattern)
mobilemetis-metis-indexes-<account_id>/
    <engagement_id>/
        index/index.faiss               # FAISS embedding index
        index/chunks.json               # chunk metadata
        index/bm25.pkl                  # BM25 sparse index

Source documents (the raw client data) live in the isolated bucket. Derived embeddings live in the shared bucket — the indexes don't contain raw client content, just numeric vectors and chunk metadata. Future hardening could move indexes into the per-engagement bucket as well; not in scope for Option C.

Async ingestion¶

POST /corpus/ingest returns 202 immediately. A FastAPI BackgroundTask runs the existing ingest.ingest.full_rebuild(slug=engagement_id, ...) pipeline (the same one PilotForge has used for SMB demos). Stage events emit onto a per-engagement asyncio.Queue:

{type: "ingest_start", file_count: N, total_bytes: M}
{type: "ingest_stage", stage: "loading",    message: "Loading N documents"}
{type: "ingest_stage", stage: "chunking",   message: "Chunking documents and building indexes"}
{type: "ingest_stage", stage: "finalizing", message: "Persisting index manifest"}
{type: "ingest_complete", engagement_id: "...", client_id: "..."}
# or
{type: "ingest_failed", error: "..."}

The events are stage transitions, not per-file granularity — full_rebuild is internally batched and doesn't emit per-file hooks. Per-file progress would require modifying the ingest pipeline; deliberately deferred since stage-level visibility is enough for "is anything happening" UX.

SSE stream¶

GET /corpus/stream emits the queue events as Server-Sent Events. Late subscribers get a snapshot event first so the UI can recover its state. A 30-second keepalive prevents proxy timeouts during long chunking phases. Closes when ingest_complete or ingest_failed arrives.

Frontend falls back to polling GET /engagement/{id} every 5 seconds if the SSE connection drops — same pattern as the audit-run stream.

UI¶

A new CorpusUploadModal component handles the entire flow:

Setup: pick firm, name client → creates engagement record
Upload: drag-and-drop or file picker → per-file upload with live status (pending/uploading/done/error). Per-file Retry button on failures. Remove uploaded files individually.
Ingest: "Start ingest" → 202 → SSE progress → stage messages render
Done: green "Corpus ingested" banner → "Open engagement" drops the partner into EngagementDetail

The modal is reachable from the Engagements tab via a new + Upload corpus button next to + New engagement.

What this doesn't do (deferred)¶

Classification override UI: full_rebuild does doc-type/jurisdiction classification internally; partners can't see or correct those choices pre-ingest. The existing PilotForge override mechanism could be wired in if real corpora reveal misclassification problems — deliberately deferred until we have evidence it bites.
True resumable multipart uploads: per-file granularity is the pragmatic substitute. A single 50MB PDF that drops mid-upload requires re-uploading that 50MB. Worth fixing if real corpora include many large files.
Per-engagement bucket for corpus indexes: indexes still live in the shared bucket. Source documents are isolated; indexes are derived embeddings without raw content. Hardening pass when a customer asks for it.

Files¶

app/auditforge/engagement.py — new CorpusStatus dataclass; EngagementStore.update_corpus
app/auditforge_endpoints.py — upload, delete, ingest, stream endpoints; per-engagement asyncio.Queue for SSE
frontend/src/api/auditforge.ts — CorpusStatus, uploadCorpusFile, ingestCorpus, deleteCorpusFile, CorpusProgressEvent
frontend/src/components/CorpusUploadModal.tsx — drag-drop UI, per-file retry, SSE consumer with polling fallback
frontend/src/components/EngagementList.tsx — + Upload corpus button
frontend/src/components/AuditForge.tsx — modal wiring
tests/test_auditforge_endpoints.py — 12 endpoint tests covering auth/role gating, validation, dedupe, lifecycle

Cost impact¶

AWS storage: ~$0.023/GB/mo. At 100 engagements × 500 MB = 50 GB = $1.15/mo
S3 PUT requests for uploads: rounding error
No new compute, no new queue service
LLM cost during ingest: unchanged (same chunking/classification pipeline as PilotForge)