07 — File Versioning

Topic: file versioning. Versioning in a content-addressed store is almost free for storage and almost entirely a metadata and retention problem. This doc designs the version model, restore, retention, immutability/WORM, snapshots, and the interaction with dedup and GC. Decision context in ADR-0020 placement is adjacent; the version model itself is recorded here and in 08 data model.


1. Model: immutable, content-addressed versions (copy-on-write at chunk level)

A Version (owned by the File & Metadata context, 08) binds a node to one Object (content_hash + manifest, 02). Versions are immutable and form an append-only chain per node; the node carries a current_version_id pointer.

flowchart LR
    classDef v fill:#dbeafe,stroke:#1e40af,color:#111827;
    classDef m fill:#fde68a,stroke:#b45309,color:#111827;
    classDef c fill:#bbf7d0,stroke:#15803d,color:#111827;
    v1["v1"]:::v --> m1["manifest1<br/>[A,B,C]"]:::m
    v2["v2"]:::v --> m2["manifest2<br/>[A,B,C2]"]:::m
    v3["v3 (current)"]:::v --> m3["manifest3<br/>[A,B2,C2]"]:::m
    m1 --> A["chunk A"]:::c
    m1 --> B["chunk B"]:::c
    m1 --> C["chunk C"]:::c
    m2 --> C2["chunk C2"]:::c
    m3 --> B2["chunk B2"]:::c
    m2 -. shares .-> A
    m2 -. shares .-> B
    m3 -. shares .-> A
    m3 -. shares .-> C2

The win: editing one region of a file changes only the CDC chunks in that region (02); the new version’s manifest shares every unchanged chunk with prior versions. A 1-byte edit to a 1 GiB file stores ~one new 1 MiB chunk, not 1 GiB. Versioning cost = the delta, achieved implicitly by chunk dedup — no explicit diff format needed.


2. Restore, copy, and conflict copies (all cheap metadata ops)


3. Retention policies (the part that actually costs)

Infinite history is infinite metadata (and eventually storage, as old unique chunks pile up). Retention bounds it. Policies are per-tenant/per-folder:

Policy Rule Use
Keep-last-N retain newest N versions default user files
Time-based retain versions for D days compliance windows
Tiered thinning keep all for 7d, daily for 30d, weekly for 1y space-efficient long history
Significant-only keep versions marked/large/explicit reduce churn noise
Legal hold / WORM retain immutably until released compliance (§5)

When a version is pruned by policy, its manifest is dereferenced; chunks unique to it become unreferenced and are reclaimed by GC after the grace period (11). Chunks still shared with retained versions survive — pruning never corrupts a kept version.


4. Snapshots (point-in-time, tenant/folder scope)

A snapshot is a named, immutable set of (node → version) bindings at an instant — essentially a manifest-of-manifests at the namespace level. Because everything is content-addressed, a snapshot is pure metadata: it pins existing versions (refs++), storing no new bytes. Snapshots enable “restore my whole folder to last Tuesday” and are the unit a retention policy or legal hold can target.

Non-goal reminder: snapshots are not a system backup/DR product (NG in 02 product goals); they are user-facing point-in-time recovery of namespace state. System DR is operational backup of Postgres + object storage (08 metadata).



6. Tradeoffs / Alternatives / Scaling

Tradeoffs. Content-addressed manifests make version storage cheap but shift the cost to metadata (a version row + a manifest per save) and to GC (pruning must compute which chunks became unreferenced). We accept metadata growth and bound it with retention + manifest nesting for huge files.

Alternatives considered.

Scaling concerns.

References