Business Continuity + Disaster Recovery¶

DRAFT — 2026-05-10. Pending review by counsel and a security/operations consultant. Customers should treat this as good-faith disclosure of current operational posture, not a binding SLA.

Recovery objectives¶

Metric	Target
Recovery Time Objective (RTO)	< 4 hours for service restoration after a regional AWS event
Recovery Point Objective (RPO)	< 1 hour for customer data (S3 versioning + cross-AZ replication)
Maximum Tolerable Downtime (MTD)	< 24 hours before customer escalation triggers

These are operational targets reflecting how the platform is built today. They are not contractual SLAs unless negotiated in a platform-license agreement.

Architecture properties supporting BC/DR¶

Property	How it's achieved
Compute redundancy	ECS Fargate with multi-AZ task placement; ALB health-checks failed tasks out of rotation; AWS auto-replaces failed tasks within minutes
Storage durability	S3 standard class — 11 nines of durability across three or more AZs in the region
Storage versioning	Every per-engagement bucket has S3 versioning enabled (Phase 7); recently-deleted or overwritten objects are recoverable for the bucket's retention window
Network redundancy	ALB across multiple AZs in us-east-1
Configuration recovery	All infrastructure defined as Terraform in `infra/`; full account rebuild from state file is documented
Secrets recovery	API keys + admin tokens stored in AWS SSM Parameter Store with versioning; rotation is a single Terraform apply
Source data recovery	Customer source documents are stored in their per-engagement bucket; partners retain their own copies of uploads, so end-to-end recoverability is preserved even on catastrophic loss

Backup strategy¶

S3 versioning is enabled on all buckets created by AuditForge. Object overwrites and deletes are recoverable for the bucket's retention window. No manual snapshot operation is required.
Engagement records (auditforge/engagements.json) are written to the shared platform bucket and to local disk on the running ECS task. The S3 copy is authoritative; local is a cache.
Per-LLM-call audit logs (auditforge/engagements/<id>/audit_log/shard-*.jsonl) are flushed to S3 on a rotating-shard basis (Phase 14) and also persisted locally on the running task. Catastrophic task loss between shard flushes loses at most the events buffered in the active shard (~minutes of audit-time worth).
Cross-region backup is not enabled today. All customer data lives in us-east-1. A region-wide AWS event would impact availability, not durability — data is recoverable but service is offline until us-east-1 restores.

Single-region risk (honest disclosure)¶

AuditForge runs entirely in us-east-1. AWS has had documented us-east-1 outages historically (most notably 2017, 2021). During such an event:

Existing engagements remain in S3 (data is durable) but are not accessible until the region recovers
New engagements cannot be created
Audit runs in flight may abort or stall

Cross-region replication is on the roadmap (see roadmap.md). It is not deployed today because:

Single-region durability at 11 nines is sufficient for the data-loss risk profile
Cross-region replication doubles S3 cost and adds operational complexity that's not justified at current customer volume
Customers retain their own copies of source documents, providing a parallel recovery path

We will deploy cross-region replication when (a) a paying customer asks for it as a contract requirement, or (b) customer count crosses 10 firms.

Restoration playbook¶

For a regional AWS outage:

Monitor AWS status dashboard for region recovery; no action possible from our side during the outage
On region recovery, validate ECS task health, ALB health-check pass rates, S3 read/write across a sample of engagement buckets
Verify audit-log shard-flush continuity (in-flight events from before the outage should resume flushing)
Customer communication: status update via docs.base2ml.com and direct email to affected firms

For a localized infrastructure issue (e.g., a corrupted task, a misconfigured deployment):

Roll back ECS service to the prior task-definition revision (aws ecs update-service --task-definition metis-demo:N)
If the prior revision is also broken, redeploy from a known-good ECR image tag
If both Docker images are bad, rebuild from auditforge branch HEAD-1 commit and force-deploy

For a corrupted engagement record:

Use S3 versioning to retrieve the prior engagements.json
Manually patch the specific engagement entry
Re-upload to S3; ECS task picks up changes on next load (cache TTL is short)

For a corrupted findings store:

Use S3 versioning to retrieve the prior engagements/<id>/findings.json
Replace the current version
Findings cache invalidates on next read

Tested vs. untested¶

Procedure	Tested in production?	Notes
ECS task rollback to prior revision	✅ Multiple times during deployment iteration	Standard ops
S3 versioning restore	⚠️ Tested manually; no recovery rehearsal
Full-region failover	❌ No cross-region deployment to fail over to
Customer-facing communications cadence	⚠️ Sample post-mortem written but no full SEV-1 rehearsal

Roadmap¶

Cross-region S3 replication (post first 10 paying customers)
On-call rotation (post first hire)
Quarterly tabletop exercises (post first hire)
Third-party BC/DR audit (with SOC 2 Type 2 engagement, mid-2027)

Contact¶

For BC/DR questions or to request a copy of historical incident reports, email chris@base2ml.com.