11 — Disaster Recovery

Task 9: design disaster recovery. DR for BitVault is unusually tractable because of the platform design: clusters are cattle (rebuild from IaC + Git), so the only irreplaceable things are data (metadata DB + object bytes) and keys (KMS). Decision in ADR-0033.


1. The DR thesis

Re-creating a cluster is declarative: tofu apply rebuilds the substrate, ArgoCD re-syncs everything in-cluster from Git, and data is restored from backups. DR is a procedure, not heroics — and it is testable on demand. The whole GitOps + IaC investment pays off here.

What is irreplaceable (and thus the focus of DR):

  1. Metadata DB (Postgres) — the source of truth; without it the bytes are unreadable noise (storage/08).
  2. Object-storage bytes — the files; durable + replicated (ADR-0020).
  3. KMS keys — lose them and encrypted data is gone (ADR-0014).

Everything else (clusters, app config, derived indexes) is rebuildable.


2. RTO / RPO targets

Component RPO (data loss) RTO (downtime) Mechanism
Metadata DB (Postgres) seconds–minutes < 1 h continuous WAL archiving → PITR; cross-region replica (12)
Object bytes ~0 minutes (failover) provider durability + cross-region/provider replication (ADR-0020)
KMS keys 0 minutes multi-region keys + deletion protection (12)
Cluster / control plane 0 (state in Git/IaC) 30–60 min tofu apply + ArgoCD re-sync
Derived: OpenSearch loose (rebuildable) rebuild time re-index from source (ADR-0009)
Derived: Redis/NATS loose minutes cache cold-start; journal replay (sync/07)

Targets are validated by game days (§5), not assumed.


3. DR scenarios (escalating)

Scenario Response Automatic?
Pod/node failure K8s reschedules; PDBs hold quorum yes (resilience, not DR)
AZ failure multi-AZ nodes + DB replicas + storage → ride through yes
Cluster loss/corruption tofu apply rebuild → ArgoCD re-sync → restore PVs/data (Velero + PITR, 12) semi (runbook)
Region loss fail over to DR region (§4) runbook + DNS
Logical disaster (ransomware, bad migration, mass delete) PITR to before the event; object-lock immutable backups; sync bulk-delete brake (sync/ADR-0027) runbook
Accidental deletion trash + version history + PITR (storage/07) self-serve

4. Region failover (warm standby)

flowchart TB
    classDef p fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    subgraph PRIMARY["Primary region"]
      pdb[("Postgres primary")]:::p
      pobj[("Object storage")]:::p
      pcl["cluster (IaC + GitOps)"]:::p
    end
    subgraph DR["DR region (warm standby)"]
      ddb[("Postgres cross-region replica")]:::d
      dobj[("Replicated object storage")]:::d
      dcl["cluster: tofu apply + ArgoCD sync"]:::d
    end
    pdb -->|stream WAL| ddb
    pobj -->|cross-region replication| dobj
    fail["Primary region lost"]:::x --> promote["promote replica → primary"]:::d
    promote --> dcl
    dcl --> dns["DNS / global LB cutover → DR region"]:::d

5. The cross-system consistency nuance (don’t miss this)

A consistent restore needs the metadata DB and the object bytes it references to agree. Two safeguards make this hold:


6. Game days & runbooks


7. Tradeoffs / Alternatives / Scaling

Tradeoffs. Warm standby (cross-region replica + replication) costs ~continuous double storage + egress; the alternative (cold/backup-restore only) is cheaper but RTO balloons to hours. We split: warm for state (low RPO), rebuild-on-demand for compute (low cost) — the cost-effective middle.

Alternatives considered.

Scaling concerns.

References