Business Continuity + Disaster Recovery¶
DRAFT — 2026-05-10. Pending review by counsel and a security/operations consultant. Customers should treat this as good-faith disclosure of current operational posture, not a binding SLA.
Recovery objectives¶
| Metric | Target |
|---|---|
| Recovery Time Objective (RTO) | < 4 hours for service restoration after a regional AWS event |
| Recovery Point Objective (RPO) | < 1 hour for customer data (S3 versioning + cross-AZ replication) |
| Maximum Tolerable Downtime (MTD) | < 24 hours before customer escalation triggers |
These are operational targets reflecting how the platform is built today. They are not contractual SLAs unless negotiated in a platform-license agreement.
Architecture properties supporting BC/DR¶
| Property | How it's achieved |
|---|---|
| Compute redundancy | ECS Fargate with multi-AZ task placement; ALB health-checks failed tasks out of rotation; AWS auto-replaces failed tasks within minutes |
| Storage durability | S3 standard class — 11 nines of durability across three or more AZs in the region |
| Storage versioning | Every per-engagement bucket has S3 versioning enabled (Phase 7); recently-deleted or overwritten objects are recoverable for the bucket's retention window |
| Network redundancy | ALB across multiple AZs in us-east-1 |
| Configuration recovery | All infrastructure defined as Terraform in infra/; full account rebuild from state file is documented |
| Secrets recovery | API keys + admin tokens stored in AWS SSM Parameter Store with versioning; rotation is a single Terraform apply |
| Source data recovery | Customer source documents are stored in their per-engagement bucket; partners retain their own copies of uploads, so end-to-end recoverability is preserved even on catastrophic loss |
Backup strategy¶
- S3 versioning is enabled on all buckets created by AuditForge. Object overwrites and deletes are recoverable for the bucket's retention window. No manual snapshot operation is required.
- Engagement records (
auditforge/engagements.json) are written to the shared platform bucket and to local disk on the running ECS task. The S3 copy is authoritative; local is a cache. - Per-LLM-call audit logs (
auditforge/engagements/<id>/audit_log/shard-*.jsonl) are flushed to S3 on a rotating-shard basis (Phase 14) and also persisted locally on the running task. Catastrophic task loss between shard flushes loses at most the events buffered in the active shard (~minutes of audit-time worth). - Cross-region backup is not enabled today. All customer data lives in
us-east-1. A region-wide AWS event would impact availability, not durability — data is recoverable but service is offline until us-east-1 restores.
Single-region risk (honest disclosure)¶
AuditForge runs entirely in us-east-1. AWS has had documented us-east-1 outages historically (most notably 2017, 2021). During such an event:
- Existing engagements remain in S3 (data is durable) but are not accessible until the region recovers
- New engagements cannot be created
- Audit runs in flight may abort or stall
Cross-region replication is on the roadmap (see roadmap.md). It is not deployed today because:
- Single-region durability at 11 nines is sufficient for the data-loss risk profile
- Cross-region replication doubles S3 cost and adds operational complexity that's not justified at current customer volume
- Customers retain their own copies of source documents, providing a parallel recovery path
We will deploy cross-region replication when (a) a paying customer asks for it as a contract requirement, or (b) customer count crosses 10 firms.
Restoration playbook¶
For a regional AWS outage:
- Monitor AWS status dashboard for region recovery; no action possible from our side during the outage
- On region recovery, validate ECS task health, ALB health-check pass rates, S3 read/write across a sample of engagement buckets
- Verify audit-log shard-flush continuity (in-flight events from before the outage should resume flushing)
- Customer communication: status update via
docs.base2ml.comand direct email to affected firms
For a localized infrastructure issue (e.g., a corrupted task, a misconfigured deployment):
- Roll back ECS service to the prior task-definition revision (
aws ecs update-service --task-definition metis-demo:N) - If the prior revision is also broken, redeploy from a known-good ECR image tag
- If both Docker images are bad, rebuild from
auditforgebranch HEAD-1 commit and force-deploy
For a corrupted engagement record:
- Use S3 versioning to retrieve the prior
engagements.json - Manually patch the specific engagement entry
- Re-upload to S3; ECS task picks up changes on next load (cache TTL is short)
For a corrupted findings store:
- Use S3 versioning to retrieve the prior
engagements/<id>/findings.json - Replace the current version
- Findings cache invalidates on next read
Tested vs. untested¶
| Procedure | Tested in production? | Notes |
|---|---|---|
| ECS task rollback to prior revision | ✅ Multiple times during deployment iteration | Standard ops |
| S3 versioning restore | ⚠️ Tested manually; no recovery rehearsal | |
| Full-region failover | ❌ No cross-region deployment to fail over to | |
| Customer-facing communications cadence | ⚠️ Sample post-mortem written but no full SEV-1 rehearsal |
Roadmap¶
- Cross-region S3 replication (post first 10 paying customers)
- On-call rotation (post first hire)
- Quarterly tabletop exercises (post first hire)
- Third-party BC/DR audit (with SOC 2 Type 2 engagement, mid-2027)
Contact¶
For BC/DR questions or to request a copy of historical incident reports, email chris@base2ml.com.