Phase 25–27 — Self-serve corpus onboarding¶
Last updated: 2026-05-10
Closes the biggest gap between "demo product" and "billable product" identified during the GTM review: corpus ingestion used to be manual ops work — Base2ML did it for every prospect over a day or two. With Option C complete, a partner-firm admin can self-serve the entire onboarding flow inside the AuditForge UI:
- Click + Upload corpus on the Engagements tab
- Name the engagement, drag-and-drop files
- Click Start ingest → background task chunks + indexes
- SSE stream surfaces progress
- Land in the engagement detail view with the corpus already bound — fill intake, click Start audit
End-to-end: a partner with 20–200 documents goes from "I have a directory of PDFs" to "I'm running an audit" in 5–10 minutes, no Base2ML involvement.
Endpoints¶
POST /auditforge/engagement/{id}/corpus/upload # multipart, one file per request
DELETE /auditforge/engagement/{id}/corpus/file/{filename}
POST /auditforge/engagement/{id}/corpus/ingest # 202 + background task
GET /auditforge/engagement/{id}/corpus/stream # SSE progress
All four accept either admin-token or per-user session-token auth; partner+ role required for mutations (associates 403 per Phase 15). Refused when the engagement is frozen (Phase 20) or already ingested.
Upload constraints¶
| Constraint | Value | Why |
|---|---|---|
| Allowed extensions | .txt .md .pdf .docx .csv .xlsx .png .jpg .jpeg .tiff .eml .mbox |
Same set the existing ingest pipeline handles |
| Max per-file size | 50 MB | One HTTP request per file; 50MB stays well under ALB's 30-minute request-body limit even on slow links |
| Max files per engagement | 500 | Conservative ceiling; real corpora are usually 20–200 docs |
| Duplicate filenames | Rejected with 409 | Forces partner to rename or delete; avoids silent overwrite |
Per-file upload (one HTTP request per file) means a network blip on file 47 of 50 only loses file 47 — the frontend retries that file individually. This is "chunked enough" without the complexity of true resumable multipart.
Corpus state machine¶
engagement.corpus.status exposes this state. Once ingested:
- engagement.client_id is set to engagement.id (the corpus IS the engagement)
- The audit-run endpoint accepts that client_id and pulls indexes from the standard location
- Further uploads / deletes are blocked (immutable corpus for that engagement)
Storage layout¶
# Per-engagement isolated bucket (Phase 7, now default-on per Phase 25 deploy)
metis-af-<engagement_id>-<account_id>/
auditforge/engagements/<engagement_id>/
source/ # uploaded source documents (Phase 25)
findings.json # canonical findings (existing)
audit_log/shard-*.jsonl # per-LLM-call audit log (existing)
# Shared platform bucket — corpus indexes (existing PilotForge pattern)
mobilemetis-metis-indexes-<account_id>/
<engagement_id>/
index/index.faiss # FAISS embedding index
index/chunks.json # chunk metadata
index/bm25.pkl # BM25 sparse index
Source documents (the raw client data) live in the isolated bucket. Derived embeddings live in the shared bucket — the indexes don't contain raw client content, just numeric vectors and chunk metadata. Future hardening could move indexes into the per-engagement bucket as well; not in scope for Option C.
Async ingestion¶
POST /corpus/ingest returns 202 immediately. A FastAPI BackgroundTask runs the existing ingest.ingest.full_rebuild(slug=engagement_id, ...) pipeline (the same one PilotForge has used for SMB demos). Stage events emit onto a per-engagement asyncio.Queue:
{type: "ingest_start", file_count: N, total_bytes: M}
{type: "ingest_stage", stage: "loading", message: "Loading N documents"}
{type: "ingest_stage", stage: "chunking", message: "Chunking documents and building indexes"}
{type: "ingest_stage", stage: "finalizing", message: "Persisting index manifest"}
{type: "ingest_complete", engagement_id: "...", client_id: "..."}
# or
{type: "ingest_failed", error: "..."}
The events are stage transitions, not per-file granularity — full_rebuild is internally batched and doesn't emit per-file hooks. Per-file progress would require modifying the ingest pipeline; deliberately deferred since stage-level visibility is enough for "is anything happening" UX.
SSE stream¶
GET /corpus/stream emits the queue events as Server-Sent Events. Late subscribers get a snapshot event first so the UI can recover its state. A 30-second keepalive prevents proxy timeouts during long chunking phases. Closes when ingest_complete or ingest_failed arrives.
Frontend falls back to polling GET /engagement/{id} every 5 seconds if the SSE connection drops — same pattern as the audit-run stream.
UI¶
A new CorpusUploadModal component handles the entire flow:
- Setup: pick firm, name client → creates engagement record
- Upload: drag-and-drop or file picker → per-file upload with live status (pending/uploading/done/error). Per-file Retry button on failures. Remove uploaded files individually.
- Ingest: "Start ingest" → 202 → SSE progress → stage messages render
- Done: green "Corpus ingested" banner → "Open engagement" drops the partner into EngagementDetail
The modal is reachable from the Engagements tab via a new + Upload corpus button next to + New engagement.
What this doesn't do (deferred)¶
- Classification override UI:
full_rebuilddoes doc-type/jurisdiction classification internally; partners can't see or correct those choices pre-ingest. The existing PilotForge override mechanism could be wired in if real corpora reveal misclassification problems — deliberately deferred until we have evidence it bites. - True resumable multipart uploads: per-file granularity is the pragmatic substitute. A single 50MB PDF that drops mid-upload requires re-uploading that 50MB. Worth fixing if real corpora include many large files.
- Per-engagement bucket for corpus indexes: indexes still live in the shared bucket. Source documents are isolated; indexes are derived embeddings without raw content. Hardening pass when a customer asks for it.
Files¶
app/auditforge/engagement.py— newCorpusStatusdataclass;EngagementStore.update_corpusapp/auditforge_endpoints.py— upload, delete, ingest, stream endpoints; per-engagement asyncio.Queue for SSEfrontend/src/api/auditforge.ts—CorpusStatus,uploadCorpusFile,ingestCorpus,deleteCorpusFile,CorpusProgressEventfrontend/src/components/CorpusUploadModal.tsx— drag-drop UI, per-file retry, SSE consumer with polling fallbackfrontend/src/components/EngagementList.tsx—+ Upload corpusbuttonfrontend/src/components/AuditForge.tsx— modal wiringtests/test_auditforge_endpoints.py— 12 endpoint tests covering auth/role gating, validation, dedupe, lifecycle
Cost impact¶
- AWS storage: ~$0.023/GB/mo. At 100 engagements × 500 MB = 50 GB = $1.15/mo
- S3 PUT requests for uploads: rounding error
- No new compute, no new queue service
- LLM cost during ingest: unchanged (same chunking/classification pipeline as PilotForge)