11 — Disaster Recovery

Task 9: design disaster recovery. DR for BitVault is unusually tractable because of the platform design: clusters are cattle (rebuild from IaC + Git), so the only irreplaceable things are data (metadata DB + object bytes) and keys (KMS). Decision in ADR-0033.

1. The DR thesis

Re-creating a cluster is declarative: tofu apply rebuilds the substrate, ArgoCD re-syncs everything in-cluster from Git, and data is restored from backups. DR is a procedure, not heroics — and it is testable on demand. The whole GitOps + IaC investment pays off here.

What is irreplaceable (and thus the focus of DR):

Metadata DB (Postgres) — the source of truth; without it the bytes are unreadable noise (storage/08).
Object-storage bytes — the files; durable + replicated (ADR-0020).
KMS keys — lose them and encrypted data is gone (ADR-0014).

Everything else (clusters, app config, derived indexes) is rebuildable.

2. RTO / RPO targets

Component	RPO (data loss)	RTO (downtime)	Mechanism
Metadata DB (Postgres)	seconds–minutes	< 1 h	continuous WAL archiving → PITR; cross-region replica (12)
Object bytes	~0	minutes (failover)	provider durability + cross-region/provider replication (ADR-0020)
KMS keys	0	minutes	multi-region keys + deletion protection (12)
Cluster / control plane	0 (state in Git/IaC)	30–60 min	`tofu apply` + ArgoCD re-sync
Derived: OpenSearch	loose (rebuildable)	rebuild time	re-index from source (ADR-0009)
Derived: Redis/NATS	loose	minutes	cache cold-start; journal replay (sync/07)

Targets are validated by game days (§5), not assumed.

3. DR scenarios (escalating)

Scenario	Response	Automatic?
Pod/node failure	K8s reschedules; PDBs hold quorum	yes (resilience, not DR)
AZ failure	multi-AZ nodes + DB replicas + storage → ride through	yes
Cluster loss/corruption	`tofu apply` rebuild → ArgoCD re-sync → restore PVs/data (Velero + PITR, 12)	semi (runbook)
Region loss	fail over to DR region (§4)	runbook + DNS
Logical disaster (ransomware, bad migration, mass delete)	PITR to before the event; object-lock immutable backups; sync bulk-delete brake (sync/ADR-0027)	runbook
Accidental deletion	trash + version history + PITR (storage/07)	self-serve

4. Region failover (warm standby)

flowchart TB
    classDef p fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    subgraph PRIMARY["Primary region"]
      pdb[("Postgres primary")]:::p
      pobj[("Object storage")]:::p
      pcl["cluster (IaC + GitOps)"]:::p
    end
    subgraph DR["DR region (warm standby)"]
      ddb[("Postgres cross-region replica")]:::d
      dobj[("Replicated object storage")]:::d
      dcl["cluster: tofu apply + ArgoCD sync"]:::d
    end
    pdb -->|stream WAL| ddb
    pobj -->|cross-region replication| dobj
    fail["Primary region lost"]:::x --> promote["promote replica → primary"]:::d
    promote --> dcl
    dcl --> dns["DNS / global LB cutover → DR region"]:::d

Warm standby for state (cross-region Postgres replica + replicated object storage) keeps RPO low; cold/fast-rebuild for compute (the DR cluster is tofu apply + ArgoCD sync, since it’s all in Git) keeps cost down.
Promote the DB replica, sync the DR cluster, cut over DNS.
Consistent with NG9 (active-active is a non-goal): this is standby failover, and the architecture does not preclude going multi-region active later.

5. The cross-system consistency nuance (don’t miss this)

A consistent restore needs the metadata DB and the object bytes it references to agree. Two safeguards make this hold:

Content-addressed, immutable blobs (storage/02): a blob referenced by a restored metadata version still has the same bytes — restore metadata to time T and the referenced chunks are valid.
Backup retention ≥ GC grace (storage/11 GC): GC must not delete a blob that any restorable metadata snapshot references. So the GC grace window and PITR window are aligned, and backups capture a metadata+blob set that is mutually consistent. Miss this and a PITR restore can dangle.

6. Game days & runbooks

Runbooks for each scenario (cluster rebuild, region failover, PITR restore), stored in docs/runbooks/ (10 docs structure).
Scheduled game days: actually destroy a nonprod cluster and rebuild from IaC+Git; actually restore a Postgres PITR and verify. An untested DR plan is fiction.
DR metrics: measured RTO/RPO from drills vs targets (§2), tracked over time.

7. Tradeoffs / Alternatives / Scaling

Tradeoffs. Warm standby (cross-region replica + replication) costs ~continuous double storage + egress; the alternative (cold/backup-restore only) is cheaper but RTO balloons to hours. We split: warm for state (low RPO), rebuild-on-demand for compute (low cost) — the cost-effective middle.

Alternatives considered.

Active-active multi-region: lowest RTO, but NG9 (huge complexity, conflict resolution across regions). Rejected for now.
Cold DR (backups only, no standby): cheapest, RTO in hours, risk of stale/region- bound backups. Insufficient for prod metadata; fine for nonprod.
Pilot light (minimal always-on DR core, scale up on failover): a reasonable middle — effectively our “warm state + cold compute” stance.

Scaling concerns.

Cross-region replication lag sets achievable RPO → monitor replica lag.
PITR restore time grows with WAL volume → periodic base backups bound replay.
Failover automation vs human-in-the-loop → we keep region failover runbook-driven (with automation assists) to avoid flapping on transient region blips.

References

CloudNativePG replicas & PITR: https://cloudnative-pg.io/documentation/current/recovery/
Velero DR (cross-cluster restore): https://velero.io/docs/main/disaster-case/
AWS/GCP/Azure cross-region replication for object storage (per provider docs)