ADR-0033 — Backup & disaster recovery (Velero + PITR; RTO/RPO targets)

Status: Deferred
Date: 2026-06-11
Related: platform/11 DR, platform/12 backup, ADR-0004, ADR-0014, ADR-0020

V1 Freeze (2026-06-12): Deferred. V1 = documented pg_dump + object-store versioning. Formal backup/DR (Velero/PITR, RTO/RPO) re-opens at P4.

Context

BitVault is a data-custody product; loss of the metadata DB or KMS keys is unrecoverable, while clusters and derived stores are rebuildable. We need explicit, tested recovery objectives and a backup posture that survives ransomware and region loss.

Decision

Backup the irreplaceable, rebuild the rest, and test restores:

Postgres metadata (critical): CloudNativePG continuous WAL archiving + base backups → PITR, to a separate account/region with object-lock. RPO seconds– minutes, RTO < 1 h.
Object bytes: provider durability + versioning + cross-region/provider replication (ADR-0020) — RPO ~0.
KMS keys: multi-region + deletion protection (lose keys ⇒ lose data, ADR-0014).
Derived stores (OpenSearch/Redis/NATS): rebuildable from source/journal — minimal backup (ADR-0009).
Cluster: state in Git + IaC → rebuild via tofu apply + ArgoCD re-sync (ADR-0028, ADR-0031); Velero for PVs/objects not in Git + cross-cluster restore.
3-2-1 + immutable (object-lock) backups in a separate trust domain, encrypted.
DR = warm standby for state, rebuild-on-demand for compute; region failover is runbook-driven (consistent with NG9 — not active-active).
Restore drills + game days are mandatory; backup/PITR retention is aligned with GC grace so a restore can never reference a GC’d blob (storage/11).

Consequences

Positive

Clear, tested RTO/RPO; ransomware-resistant (immutable, separate-domain backups); cluster rebuild is declarative and fast.
Cost-balanced: warm only for low-RPO state, cold/rebuild for compute.

Negative / costs

Cross-region replication + immutable retention cost storage/egress (bounded by tiering to cold, storage/10).
Region failover is a runbook (not instant) — deliberate, to avoid flapping.

Alternatives considered

Snapshots only (no WAL): RPO = snapshot interval; rejected for metadata.
Active-active multi-region: lowest RTO, but NG9 complexity. Rejected for now.
Cold DR only: cheapest, RTO in hours; fine for nonprod, insufficient for prod metadata.
Mutable backups: ransomware deletes them; object-lock is non-negotiable.

Scaling

Object-store “backup” is replication, not copy (infeasible to copy PB); frequent base backups bound PITR replay; restore drills run sampled regularly + full-scale periodically.