12 — Backup Strategies

Task 10: design backup strategies. What we back up, how, where, for how long, and — the part most teams skip — how we prove the backups restore. Decision in ADR-0033.

1. What to back up (and how)

Asset	Criticality	Method	Notes
Postgres (metadata)	🔴 critical	CloudNativePG: continuous WAL archiving + periodic base backups → object storage → PITR	the source of truth; durability ≥ data (storage/08)
Object bytes (files)	🔴 critical	provider durability + versioning + cross-region/provider replication (ADR-0020)	content-addressed/immutable → “backup” = replication + versions
KMS keys	🔴 critical	provider-managed durability + multi-region keys + deletion protection	lose keys ⇒ lose encrypted data (ADR-0014); material never exported
OpenSearch	🟡 rebuildable	snapshots → object storage (low priority)	re-indexable from source (ADR-0009)
Redis	🟢 mostly cache	RDB/AOF snapshot of persistent bits	cache cold-starts; not authoritative
NATS JetStream	🟡	stream replicas + snapshots	journal is replayable (sync/07)
K8s PVs / stateful objects	🟡	Velero (resources + CSI volume snapshots)	cross-cluster restore for DR
Cluster config	🟢 in Git	GitOps repo + IaC (re-appliable)	no separate backup needed

The backup priority mirrors the data model: metadata + keys are sacred (lose them and everything is lost); bytes are durable + replicated; derived stores are rebuildable and barely need backup.

2. Backup topology

flowchart TB
    classDef src fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef b fill:#fde68a,stroke:#b45309,color:#111827;
    classDef v fill:#c7d2fe,stroke:#3730a3,color:#111827;
    pg[("Postgres (metadata)")]:::src -->|WAL + base| pgb[("backup bucket<br/>separate account + region + OBJECT-LOCK")]:::b
    obj[("Object storage (bytes)")]:::src -->|versioning + replication| objr[("replica region / provider")]:::b
    pv[("Stateful PVs")]:::src -->|Velero + CSI snapshot| velb[("Velero backup store (object-lock)")]:::b
    os[("OpenSearch")]:::src -->|snapshot| ossnap[("snapshot bucket")]:::b
    pgb & velb -. tiered to cold + GFS retention .-> arch[("archive tier ([storage/10])")]:::v
    pgb --> drill["scheduled restore drill → verify integrity"]:::v

3. Backup principles

3-2-1 (+1 immutable): ≥3 copies, ≥2 media/locations, ≥1 offsite, +1 immutable (object-lock / WORM) so ransomware or a rogue admin cannot delete or alter backups (storage/07 WORM, sync/ADR-0027).
Separate trust domain: backups live in a separate cloud account/project + region with least-privilege, append-only write access — a compromise of prod must not reach the backups.
Encrypted backups (independent keys from prod data keys).
Retention (GFS): daily (short), weekly (medium), monthly/yearly (long), tiered to cold storage as they age (storage/10).
Restore testing is mandatory: automated, scheduled restore drills with integrity verification. A backup you have never restored is Schrödinger’s backup — it is both valid and invalid until you test it. Restore success/time is a tracked metric (11 §6).

4. The backup ⇄ GC consistency rule (subtle, important)

GC deletes unreferenced blobs after a grace period (storage/11). A PITR restore of the metadata DB to time T will reference blobs that existed at T. Therefore backup/PITR retention must be ≤ how far GC’s grace + retention guarantees blobs still exist — or, equivalently, GC must not reclaim a blob still referenceable by any restorable metadata snapshot. We align the PITR window with GC/version retention so a restore can never dangle (11 §5). This cross-system invariant is easy to miss and corrupts restores if violated.

5. Self-host backups

Self-host ships the same primitives at a smaller scale: CloudNativePG PITR (or pg_dump for lite), object-storage versioning, and a documented restore runbook — so a self-hoster’s data is as recoverable as SaaS, proportional to their tier (ADR-0012).

6. Tradeoffs / Alternatives / Scaling

Tradeoffs. Immutable, multi-region, frequently-tested backups cost storage + egress + operational effort. For a data-custody product this is the cost of being trustworthy; retention tiering + cold storage bound the bill.

Alternatives considered.

Snapshots only (no continuous WAL): simpler, but RPO = snapshot interval (hours of loss). Rejected for metadata; WAL/PITR gives seconds-RPO.
pg_dump logical backups as primary: fine for small/self-host lite; too slow/ coarse for large prod metadata; PITR is primary, logical dumps a secondary safety net.
Back up the object store by copying: infeasible at PB scale — replication + versioning is the backup; copying petabytes is not.
Mutable backups: a ransomware foothold deletes them → object-lock is non-negotiable.

Scaling concerns.

Object-store “backup” at PB scale = replication, not copy (§1); cost managed by tiering replicas to cold (storage/10).
WAL volume on a busy metadata DB → frequent base backups bound PITR replay time (11).
Restore-drill cost → drill on a sampled/representative dataset regularly, full-scale periodically.
Backup of millions of small PVs → prefer app-level backups (PITR) over volume snapshots where possible; Velero for what isn’t in Git/PITR.

References

CloudNativePG backup/recovery (WAL, PITR): https://cloudnative-pg.io/documentation/current/backup/
Velero: https://velero.io/docs/main/
S3 Object Lock (immutable backups): https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html
3-2-1 backup rule: https://www.cisa.gov/news-events/news/world-backup-day-3-2-1-rule