ADR-0033 — Backup & disaster recovery (Velero + PITR; RTO/RPO targets)
- Status: Deferred
- Date: 2026-06-11
- Related: platform/11 DR, platform/12 backup, ADR-0004, ADR-0014, ADR-0020
V1 Freeze (2026-06-12): Deferred. V1 = documented pg_dump + object-store versioning. Formal backup/DR (Velero/PITR, RTO/RPO) re-opens at P4.
Context
BitVault is a data-custody product; loss of the metadata DB or KMS keys is unrecoverable, while clusters and derived stores are rebuildable. We need explicit, tested recovery objectives and a backup posture that survives ransomware and region loss.
Decision
Backup the irreplaceable, rebuild the rest, and test restores:
- Postgres metadata (critical): CloudNativePG continuous WAL archiving + base backups → PITR, to a separate account/region with object-lock. RPO seconds– minutes, RTO < 1 h.
- Object bytes: provider durability + versioning + cross-region/provider replication (ADR-0020) — RPO ~0.
- KMS keys: multi-region + deletion protection (lose keys ⇒ lose data, ADR-0014).
- Derived stores (OpenSearch/Redis/NATS): rebuildable from source/journal — minimal backup (ADR-0009).
- Cluster: state in Git + IaC → rebuild via
tofu apply+ ArgoCD re-sync (ADR-0028, ADR-0031); Velero for PVs/objects not in Git + cross-cluster restore. - 3-2-1 + immutable (object-lock) backups in a separate trust domain, encrypted.
- DR = warm standby for state, rebuild-on-demand for compute; region failover is runbook-driven (consistent with NG9 — not active-active).
- Restore drills + game days are mandatory; backup/PITR retention is aligned with GC grace so a restore can never reference a GC’d blob (storage/11).
Consequences
Positive
- Clear, tested RTO/RPO; ransomware-resistant (immutable, separate-domain backups); cluster rebuild is declarative and fast.
- Cost-balanced: warm only for low-RPO state, cold/rebuild for compute.
Negative / costs
- Cross-region replication + immutable retention cost storage/egress (bounded by tiering to cold, storage/10).
- Region failover is a runbook (not instant) — deliberate, to avoid flapping.
Alternatives considered
- Snapshots only (no WAL): RPO = snapshot interval; rejected for metadata.
- Active-active multi-region: lowest RTO, but NG9 complexity. Rejected for now.
- Cold DR only: cheapest, RTO in hours; fine for nonprod, insufficient for prod metadata.
- Mutable backups: ransomware deletes them; object-lock is non-negotiable.
Scaling
Object-store “backup” is replication, not copy (infeasible to copy PB); frequent base backups bound PITR replay; restore drills run sampled regularly + full-scale periodically.