Skip to content

Business Continuity + Disaster Recovery

DRAFT — 2026-05-10. Pending review by counsel and a security/operations consultant. Customers should treat this as good-faith disclosure of current operational posture, not a binding SLA.

Recovery objectives

Metric Target
Recovery Time Objective (RTO) < 4 hours for service restoration after a regional AWS event
Recovery Point Objective (RPO) < 1 hour for customer data (S3 versioning + cross-AZ replication)
Maximum Tolerable Downtime (MTD) < 24 hours before customer escalation triggers

These are operational targets reflecting how the platform is built today. They are not contractual SLAs unless negotiated in a platform-license agreement.

Architecture properties supporting BC/DR

Property How it's achieved
Compute redundancy ECS Fargate with multi-AZ task placement; ALB health-checks failed tasks out of rotation; AWS auto-replaces failed tasks within minutes
Storage durability S3 standard class — 11 nines of durability across three or more AZs in the region
Storage versioning Every per-engagement bucket has S3 versioning enabled (Phase 7); recently-deleted or overwritten objects are recoverable for the bucket's retention window
Network redundancy ALB across multiple AZs in us-east-1
Configuration recovery All infrastructure defined as Terraform in infra/; full account rebuild from state file is documented
Secrets recovery API keys + admin tokens stored in AWS SSM Parameter Store with versioning; rotation is a single Terraform apply
Source data recovery Customer source documents are stored in their per-engagement bucket; partners retain their own copies of uploads, so end-to-end recoverability is preserved even on catastrophic loss

Backup strategy

  • S3 versioning is enabled on all buckets created by AuditForge. Object overwrites and deletes are recoverable for the bucket's retention window. No manual snapshot operation is required.
  • Engagement records (auditforge/engagements.json) are written to the shared platform bucket and to local disk on the running ECS task. The S3 copy is authoritative; local is a cache.
  • Per-LLM-call audit logs (auditforge/engagements/<id>/audit_log/shard-*.jsonl) are flushed to S3 on a rotating-shard basis (Phase 14) and also persisted locally on the running task. Catastrophic task loss between shard flushes loses at most the events buffered in the active shard (~minutes of audit-time worth).
  • Cross-region backup is not enabled today. All customer data lives in us-east-1. A region-wide AWS event would impact availability, not durability — data is recoverable but service is offline until us-east-1 restores.

Single-region risk (honest disclosure)

AuditForge runs entirely in us-east-1. AWS has had documented us-east-1 outages historically (most notably 2017, 2021). During such an event:

  • Existing engagements remain in S3 (data is durable) but are not accessible until the region recovers
  • New engagements cannot be created
  • Audit runs in flight may abort or stall

Cross-region replication is on the roadmap (see roadmap.md). It is not deployed today because:

  1. Single-region durability at 11 nines is sufficient for the data-loss risk profile
  2. Cross-region replication doubles S3 cost and adds operational complexity that's not justified at current customer volume
  3. Customers retain their own copies of source documents, providing a parallel recovery path

We will deploy cross-region replication when (a) a paying customer asks for it as a contract requirement, or (b) customer count crosses 10 firms.

Restoration playbook

For a regional AWS outage:

  1. Monitor AWS status dashboard for region recovery; no action possible from our side during the outage
  2. On region recovery, validate ECS task health, ALB health-check pass rates, S3 read/write across a sample of engagement buckets
  3. Verify audit-log shard-flush continuity (in-flight events from before the outage should resume flushing)
  4. Customer communication: status update via docs.base2ml.com and direct email to affected firms

For a localized infrastructure issue (e.g., a corrupted task, a misconfigured deployment):

  1. Roll back ECS service to the prior task-definition revision (aws ecs update-service --task-definition metis-demo:N)
  2. If the prior revision is also broken, redeploy from a known-good ECR image tag
  3. If both Docker images are bad, rebuild from auditforge branch HEAD-1 commit and force-deploy

For a corrupted engagement record:

  1. Use S3 versioning to retrieve the prior engagements.json
  2. Manually patch the specific engagement entry
  3. Re-upload to S3; ECS task picks up changes on next load (cache TTL is short)

For a corrupted findings store:

  1. Use S3 versioning to retrieve the prior engagements/<id>/findings.json
  2. Replace the current version
  3. Findings cache invalidates on next read

Tested vs. untested

Procedure Tested in production? Notes
ECS task rollback to prior revision ✅ Multiple times during deployment iteration Standard ops
S3 versioning restore ⚠️ Tested manually; no recovery rehearsal
Full-region failover ❌ No cross-region deployment to fail over to
Customer-facing communications cadence ⚠️ Sample post-mortem written but no full SEV-1 rehearsal

Roadmap

  • Cross-region S3 replication (post first 10 paying customers)
  • On-call rotation (post first hire)
  • Quarterly tabletop exercises (post first hire)
  • Third-party BC/DR audit (with SOC 2 Type 2 engagement, mid-2027)

Contact

For BC/DR questions or to request a copy of historical incident reports, email chris@base2ml.com.