12 — Backup Strategies
Task 10: design backup strategies. What we back up, how, where, for how long, and — the part most teams skip — how we prove the backups restore. Decision in ADR-0033.
1. What to back up (and how)
| Asset | Criticality | Method | Notes |
|---|---|---|---|
| Postgres (metadata) | 🔴 critical | CloudNativePG: continuous WAL archiving + periodic base backups → object storage → PITR | the source of truth; durability ≥ data (storage/08) |
| Object bytes (files) | 🔴 critical | provider durability + versioning + cross-region/provider replication (ADR-0020) | content-addressed/immutable → “backup” = replication + versions |
| KMS keys | 🔴 critical | provider-managed durability + multi-region keys + deletion protection | lose keys ⇒ lose encrypted data (ADR-0014); material never exported |
| OpenSearch | 🟡 rebuildable | snapshots → object storage (low priority) | re-indexable from source (ADR-0009) |
| Redis | 🟢 mostly cache | RDB/AOF snapshot of persistent bits | cache cold-starts; not authoritative |
| NATS JetStream | 🟡 | stream replicas + snapshots | journal is replayable (sync/07) |
| K8s PVs / stateful objects | 🟡 | Velero (resources + CSI volume snapshots) | cross-cluster restore for DR |
| Cluster config | 🟢 in Git | GitOps repo + IaC (re-appliable) | no separate backup needed |
The backup priority mirrors the data model: metadata + keys are sacred (lose them and everything is lost); bytes are durable + replicated; derived stores are rebuildable and barely need backup.
2. Backup topology
flowchart TB
classDef src fill:#bbf7d0,stroke:#15803d,color:#111827;
classDef b fill:#fde68a,stroke:#b45309,color:#111827;
classDef v fill:#c7d2fe,stroke:#3730a3,color:#111827;
pg[("Postgres (metadata)")]:::src -->|WAL + base| pgb[("backup bucket<br/>separate account + region + OBJECT-LOCK")]:::b
obj[("Object storage (bytes)")]:::src -->|versioning + replication| objr[("replica region / provider")]:::b
pv[("Stateful PVs")]:::src -->|Velero + CSI snapshot| velb[("Velero backup store (object-lock)")]:::b
os[("OpenSearch")]:::src -->|snapshot| ossnap[("snapshot bucket")]:::b
pgb & velb -. tiered to cold + GFS retention .-> arch[("archive tier ([storage/10])")]:::v
pgb --> drill["scheduled restore drill → verify integrity"]:::v
3. Backup principles
- 3-2-1 (+1 immutable): ≥3 copies, ≥2 media/locations, ≥1 offsite, +1 immutable (object-lock / WORM) so ransomware or a rogue admin cannot delete or alter backups (storage/07 WORM, sync/ADR-0027).
- Separate trust domain: backups live in a separate cloud account/project + region with least-privilege, append-only write access — a compromise of prod must not reach the backups.
- Encrypted backups (independent keys from prod data keys).
- Retention (GFS): daily (short), weekly (medium), monthly/yearly (long), tiered to cold storage as they age (storage/10).
- Restore testing is mandatory: automated, scheduled restore drills with integrity verification. A backup you have never restored is Schrödinger’s backup — it is both valid and invalid until you test it. Restore success/time is a tracked metric (11 §6).
4. The backup ⇄ GC consistency rule (subtle, important)
GC deletes unreferenced blobs after a grace period (storage/11).
A PITR restore of the metadata DB to time T will reference blobs that existed at T.
Therefore backup/PITR retention must be ≤ how far GC’s grace + retention guarantees
blobs still exist — or, equivalently, GC must not reclaim a blob still referenceable by
any restorable metadata snapshot. We align the PITR window with GC/version retention
so a restore can never dangle (11 §5).
This cross-system invariant is easy to miss and corrupts restores if violated.
5. Self-host backups
Self-host ships the same primitives at a smaller scale: CloudNativePG PITR (or pg_dump
for lite), object-storage versioning, and a documented restore runbook — so a
self-hoster’s data is as recoverable as SaaS, proportional to their tier
(ADR-0012).
6. Tradeoffs / Alternatives / Scaling
Tradeoffs. Immutable, multi-region, frequently-tested backups cost storage + egress + operational effort. For a data-custody product this is the cost of being trustworthy; retention tiering + cold storage bound the bill.
Alternatives considered.
- Snapshots only (no continuous WAL): simpler, but RPO = snapshot interval (hours of loss). Rejected for metadata; WAL/PITR gives seconds-RPO.
pg_dumplogical backups as primary: fine for small/self-hostlite; too slow/ coarse for large prod metadata; PITR is primary, logical dumps a secondary safety net.- Back up the object store by copying: infeasible at PB scale — replication + versioning is the backup; copying petabytes is not.
- Mutable backups: a ransomware foothold deletes them → object-lock is non-negotiable.
Scaling concerns.
- Object-store “backup” at PB scale = replication, not copy (§1); cost managed by tiering replicas to cold (storage/10).
- WAL volume on a busy metadata DB → frequent base backups bound PITR replay time (11).
- Restore-drill cost → drill on a sampled/representative dataset regularly, full-scale periodically.
- Backup of millions of small PVs → prefer app-level backups (PITR) over volume snapshots where possible; Velero for what isn’t in Git/PITR.
References
- CloudNativePG backup/recovery (WAL, PITR): https://cloudnative-pg.io/documentation/current/backup/
- Velero: https://velero.io/docs/main/
- S3 Object Lock (immutable backups): https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html
- 3-2-1 backup rule: https://www.cisa.gov/news-events/news/world-backup-day-3-2-1-rule