11 — Disaster Recovery
Task 9: design disaster recovery. DR for BitVault is unusually tractable because of the platform design: clusters are cattle (rebuild from IaC + Git), so the only irreplaceable things are data (metadata DB + object bytes) and keys (KMS). Decision in ADR-0033.
1. The DR thesis
Re-creating a cluster is declarative:
tofu applyrebuilds the substrate, ArgoCD re-syncs everything in-cluster from Git, and data is restored from backups. DR is a procedure, not heroics — and it is testable on demand. The whole GitOps + IaC investment pays off here.
What is irreplaceable (and thus the focus of DR):
- Metadata DB (Postgres) — the source of truth; without it the bytes are unreadable noise (storage/08).
- Object-storage bytes — the files; durable + replicated (ADR-0020).
- KMS keys — lose them and encrypted data is gone (ADR-0014).
Everything else (clusters, app config, derived indexes) is rebuildable.
2. RTO / RPO targets
| Component | RPO (data loss) | RTO (downtime) | Mechanism |
|---|---|---|---|
| Metadata DB (Postgres) | seconds–minutes | < 1 h | continuous WAL archiving → PITR; cross-region replica (12) |
| Object bytes | ~0 | minutes (failover) | provider durability + cross-region/provider replication (ADR-0020) |
| KMS keys | 0 | minutes | multi-region keys + deletion protection (12) |
| Cluster / control plane | 0 (state in Git/IaC) | 30–60 min | tofu apply + ArgoCD re-sync |
| Derived: OpenSearch | loose (rebuildable) | rebuild time | re-index from source (ADR-0009) |
| Derived: Redis/NATS | loose | minutes | cache cold-start; journal replay (sync/07) |
Targets are validated by game days (§5), not assumed.
3. DR scenarios (escalating)
| Scenario | Response | Automatic? |
|---|---|---|
| Pod/node failure | K8s reschedules; PDBs hold quorum | yes (resilience, not DR) |
| AZ failure | multi-AZ nodes + DB replicas + storage → ride through | yes |
| Cluster loss/corruption | tofu apply rebuild → ArgoCD re-sync → restore PVs/data (Velero + PITR, 12) |
semi (runbook) |
| Region loss | fail over to DR region (§4) | runbook + DNS |
| Logical disaster (ransomware, bad migration, mass delete) | PITR to before the event; object-lock immutable backups; sync bulk-delete brake (sync/ADR-0027) | runbook |
| Accidental deletion | trash + version history + PITR (storage/07) | self-serve |
4. Region failover (warm standby)
flowchart TB
classDef p fill:#bbf7d0,stroke:#15803d,color:#111827;
classDef d fill:#fde68a,stroke:#b45309,color:#111827;
classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
subgraph PRIMARY["Primary region"]
pdb[("Postgres primary")]:::p
pobj[("Object storage")]:::p
pcl["cluster (IaC + GitOps)"]:::p
end
subgraph DR["DR region (warm standby)"]
ddb[("Postgres cross-region replica")]:::d
dobj[("Replicated object storage")]:::d
dcl["cluster: tofu apply + ArgoCD sync"]:::d
end
pdb -->|stream WAL| ddb
pobj -->|cross-region replication| dobj
fail["Primary region lost"]:::x --> promote["promote replica → primary"]:::d
promote --> dcl
dcl --> dns["DNS / global LB cutover → DR region"]:::d
- Warm standby for state (cross-region Postgres replica + replicated object storage)
keeps RPO low; cold/fast-rebuild for compute (the DR cluster is
tofu apply+ ArgoCD sync, since it’s all in Git) keeps cost down. - Promote the DB replica, sync the DR cluster, cut over DNS.
- Consistent with NG9 (active-active is a non-goal): this is standby failover, and the architecture does not preclude going multi-region active later.
5. The cross-system consistency nuance (don’t miss this)
A consistent restore needs the metadata DB and the object bytes it references to agree. Two safeguards make this hold:
- Content-addressed, immutable blobs (storage/02):
a blob referenced by a restored metadata version still has the same bytes — restore
metadata to time
Tand the referenced chunks are valid. - Backup retention ≥ GC grace (storage/11 GC): GC must not delete a blob that any restorable metadata snapshot references. So the GC grace window and PITR window are aligned, and backups capture a metadata+blob set that is mutually consistent. Miss this and a PITR restore can dangle.
6. Game days & runbooks
- Runbooks for each scenario (cluster rebuild, region failover, PITR restore),
stored in
docs/runbooks/(10 docs structure). - Scheduled game days: actually destroy a nonprod cluster and rebuild from IaC+Git; actually restore a Postgres PITR and verify. An untested DR plan is fiction.
- DR metrics: measured RTO/RPO from drills vs targets (§2), tracked over time.
7. Tradeoffs / Alternatives / Scaling
Tradeoffs. Warm standby (cross-region replica + replication) costs ~continuous double storage + egress; the alternative (cold/backup-restore only) is cheaper but RTO balloons to hours. We split: warm for state (low RPO), rebuild-on-demand for compute (low cost) — the cost-effective middle.
Alternatives considered.
- Active-active multi-region: lowest RTO, but NG9 (huge complexity, conflict resolution across regions). Rejected for now.
- Cold DR (backups only, no standby): cheapest, RTO in hours, risk of stale/region- bound backups. Insufficient for prod metadata; fine for nonprod.
- Pilot light (minimal always-on DR core, scale up on failover): a reasonable middle — effectively our “warm state + cold compute” stance.
Scaling concerns.
- Cross-region replication lag sets achievable RPO → monitor replica lag.
- PITR restore time grows with WAL volume → periodic base backups bound replay.
- Failover automation vs human-in-the-loop → we keep region failover runbook-driven (with automation assists) to avoid flapping on transient region blips.
References
- CloudNativePG replicas & PITR: https://cloudnative-pg.io/documentation/current/recovery/
- Velero DR (cross-cluster restore): https://velero.io/docs/main/disaster-case/
- AWS/GCP/Azure cross-region replication for object storage (per provider docs)