11 — Garbage Collection

Topic: garbage collection. The hardest correctness problem in the subsystem. Dedup means deleting is not the inverse of writing: you may only delete bytes that no manifest anywhere references — and a new reference can appear (a dedup hit) at the exact moment you decide to delete. Get GC wrong and you either leak storage forever or delete live data. Decision in ADR-0019.


1. What GC must reclaim

Constraints it must respect: WORM retain_until floors (07), the grace period (SI-2), and tenant isolation.


2. Why refcounting alone is unsafe (the core race)

The naive design — keep a per-chunk refcount, delete at zero — has a fatal race:

GC: reads refcount(chunk X) == 0  ───────────────┐
                                                  │  (concurrent)
Commit: dedup hit on X → refcount++ , manifest M references X
GC: deletes X's bytes  ◄──────────────────────────┘
Result: M references deleted bytes → dangling reference → corruption

Distributed refcounts also drift under crashes, retries, and partial failures. So: refcounts are a hint/optimization, never the deletion authority. Deletion is authorized by reachability (mark-and-sweep) plus a grace period plus a re-check under compare-and-swap.


3. The chunk lifecycle state machine (deletion is gated, reversible until the last step)

stateDiagram-v2
    [*] --> Writing: presigned PUT to staging
    Writing --> Committed: commit verifies + manifest refs (SI-1)
    Writing --> Unreferenced: abandoned (staging TTL)
    Committed --> Packed: packer coalesces into a pack
    Committed --> Unreferenced: last reference removed (edge count 0)
    Packed --> Unreferenced: last reference removed
    Unreferenced --> Committed: re-referenced by a dedup hit<br/>(reset unreferenced_at) 
    Unreferenced --> Deleting: unreferenced_at older than GRACE<br/>AND still 0 refs (CAS)
    Deleting --> Committed: re-referenced before byte delete<br/>(CAS abort, restore)
    Deleting --> [*]: bytes deleted / pack space reclaimed
    note right of Deleting
        retain_until (WORM) blocks
        the transition into Deleting.
    end note

The two safety hinges:

  1. Unreferenced → Deleting requires the grace period to have elapsed and a re-confirmation of zero references in the same CAS. A dedup hit during the window flips the chunk back to Committed (resets unreferenced_at), so the race in §2 cannot delete live data.
  2. Commit treats a Deleting chunk as missing. If a dedup negotiation matches a chunk that is mid-deletion, the client is told to re-upload it (cheap, safe) rather than referencing bytes that may vanish. Correctness over a rare extra PUT.

4. The GC algorithm: hybrid incremental + periodic full mark-sweep

A full mark-and-sweep over billions of chunks/manifests every cycle is prohibitively expensive and is why tools like restic lock the repository during prune — unacceptable at our scale and uptime. We use a hybrid, online design:

(a) Incremental candidate-based GC (the common path).

(b) Periodic full mark-sweep (the backstop).

flowchart TB
    classDef m fill:#fde68a,stroke:#b45309,color:#111827;
    classDef s fill:#fed7aa,stroke:#c2410c,color:#111827;
    classDef d fill:#bbf7d0,stroke:#15803d,color:#111827;
    prune["Version prune / trash purge / tenant delete ([07],[10])"]:::m --> edges["Remove REF_EDGEs"]:::m
    edges --> cand{"chunk edge-count == 0?"}:::m
    cand -- yes --> mark["Stamp unreferenced_at, enqueue candidate"]:::s
    mark --> wait["Grace period elapses"]:::s
    wait --> recheck{"still 0 refs AND retain_until passed? (CAS)"}:::s
    recheck -- no --> resurrect["Back to Committed"]:::d
    recheck -- yes --> del["Delete bytes / mark pack dead space"]:::d
    full["Periodic full mark-sweep (online, partitioned)"]:::s -. reconciles drift .-> recheck

5. Pack compaction (reclaiming dead space)

You cannot delete one chunk from inside an immutable pack object. So:

Tradeoff: lower repack threshold = less wasted space but more IO/cost; higher = cheaper but more slack. Tunable; default leaves some slack (à la restic’s 5–10% allowance) because rewriting packs is expensive.


6. Packing vs presigned direct-upload (the reconciliation)

ADR-0011 says bytes bypass our compute; packing reads/writes bytes through compute. Reconciliation:

So the presigned property holds for everything users observe; packing is internal, asynchronous, and bounded.


7. Crash safety, idempotency, isolation


8. Tradeoffs / Alternatives / Scaling

Tradeoffs. The grace period is the dial between safety (longer = more sure no live data is deleted, tolerates clock skew/lag) and space efficiency (longer = dead bytes linger). We default conservative (e.g. 24–72h) because deleting live data is unrecoverable while leaking space is merely costly — an asymmetric risk that should always favor safety.

Alternatives considered.

Scaling concerns.

References