12 — Backup Strategies

Task 10: design backup strategies. What we back up, how, where, for how long, and — the part most teams skip — how we prove the backups restore. Decision in ADR-0033.


1. What to back up (and how)

Asset Criticality Method Notes
Postgres (metadata) 🔴 critical CloudNativePG: continuous WAL archiving + periodic base backups → object storage → PITR the source of truth; durability data (storage/08)
Object bytes (files) 🔴 critical provider durability + versioning + cross-region/provider replication (ADR-0020) content-addressed/immutable → “backup” = replication + versions
KMS keys 🔴 critical provider-managed durability + multi-region keys + deletion protection lose keys ⇒ lose encrypted data (ADR-0014); material never exported
OpenSearch 🟡 rebuildable snapshots → object storage (low priority) re-indexable from source (ADR-0009)
Redis 🟢 mostly cache RDB/AOF snapshot of persistent bits cache cold-starts; not authoritative
NATS JetStream 🟡 stream replicas + snapshots journal is replayable (sync/07)
K8s PVs / stateful objects 🟡 Velero (resources + CSI volume snapshots) cross-cluster restore for DR
Cluster config 🟢 in Git GitOps repo + IaC (re-appliable) no separate backup needed

The backup priority mirrors the data model: metadata + keys are sacred (lose them and everything is lost); bytes are durable + replicated; derived stores are rebuildable and barely need backup.


2. Backup topology

flowchart TB
    classDef src fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef b fill:#fde68a,stroke:#b45309,color:#111827;
    classDef v fill:#c7d2fe,stroke:#3730a3,color:#111827;
    pg[("Postgres (metadata)")]:::src -->|WAL + base| pgb[("backup bucket<br/>separate account + region + OBJECT-LOCK")]:::b
    obj[("Object storage (bytes)")]:::src -->|versioning + replication| objr[("replica region / provider")]:::b
    pv[("Stateful PVs")]:::src -->|Velero + CSI snapshot| velb[("Velero backup store (object-lock)")]:::b
    os[("OpenSearch")]:::src -->|snapshot| ossnap[("snapshot bucket")]:::b
    pgb & velb -. tiered to cold + GFS retention .-> arch[("archive tier ([storage/10])")]:::v
    pgb --> drill["scheduled restore drill → verify integrity"]:::v

3. Backup principles


4. The backup ⇄ GC consistency rule (subtle, important)

GC deletes unreferenced blobs after a grace period (storage/11). A PITR restore of the metadata DB to time T will reference blobs that existed at T. Therefore backup/PITR retention must be ≤ how far GC’s grace + retention guarantees blobs still exist — or, equivalently, GC must not reclaim a blob still referenceable by any restorable metadata snapshot. We align the PITR window with GC/version retention so a restore can never dangle (11 §5). This cross-system invariant is easy to miss and corrupts restores if violated.


5. Self-host backups

Self-host ships the same primitives at a smaller scale: CloudNativePG PITR (or pg_dump for lite), object-storage versioning, and a documented restore runbook — so a self-hoster’s data is as recoverable as SaaS, proportional to their tier (ADR-0012).


6. Tradeoffs / Alternatives / Scaling

Tradeoffs. Immutable, multi-region, frequently-tested backups cost storage + egress + operational effort. For a data-custody product this is the cost of being trustworthy; retention tiering + cold storage bound the bill.

Alternatives considered.

Scaling concerns.

References