11 — Garbage Collection
Topic: garbage collection. The hardest correctness problem in the subsystem. Dedup means deleting is not the inverse of writing: you may only delete bytes that no manifest anywhere references — and a new reference can appear (a dedup hit) at the exact moment you decide to delete. Get GC wrong and you either leak storage forever or delete live data. Decision in ADR-0019.
1. What GC must reclaim
- Unreferenced chunks — no manifest references them after version pruning (07), trash purge, or tenant deletion.
- Dead space inside packs — a pack keeps live bytes even after some member chunks die; reclaimed by repacking.
- Abandoned staging — incomplete uploads / aborted multipart (05).
- Redundant standalone objects — a chunk that has been copied into a pack; the standalone object is deleted after the index points at the pack.
Constraints it must respect: WORM retain_until floors (07),
the grace period (SI-2), and tenant isolation.
2. Why refcounting alone is unsafe (the core race)
The naive design — keep a per-chunk refcount, delete at zero — has a fatal race:
GC: reads refcount(chunk X) == 0 ───────────────┐
│ (concurrent)
Commit: dedup hit on X → refcount++ , manifest M references X
GC: deletes X's bytes ◄──────────────────────────┘
Result: M references deleted bytes → dangling reference → corruption
Distributed refcounts also drift under crashes, retries, and partial failures. So: refcounts are a hint/optimization, never the deletion authority. Deletion is authorized by reachability (mark-and-sweep) plus a grace period plus a re-check under compare-and-swap.
3. The chunk lifecycle state machine (deletion is gated, reversible until the last step)
stateDiagram-v2
[*] --> Writing: presigned PUT to staging
Writing --> Committed: commit verifies + manifest refs (SI-1)
Writing --> Unreferenced: abandoned (staging TTL)
Committed --> Packed: packer coalesces into a pack
Committed --> Unreferenced: last reference removed (edge count 0)
Packed --> Unreferenced: last reference removed
Unreferenced --> Committed: re-referenced by a dedup hit<br/>(reset unreferenced_at)
Unreferenced --> Deleting: unreferenced_at older than GRACE<br/>AND still 0 refs (CAS)
Deleting --> Committed: re-referenced before byte delete<br/>(CAS abort, restore)
Deleting --> [*]: bytes deleted / pack space reclaimed
note right of Deleting
retain_until (WORM) blocks
the transition into Deleting.
end note
The two safety hinges:
Unreferenced → Deletingrequires the grace period to have elapsed and a re-confirmation of zero references in the same CAS. A dedup hit during the window flips the chunk back toCommitted(resetsunreferenced_at), so the race in §2 cannot delete live data.- Commit treats a
Deletingchunk as missing. If a dedup negotiation matches a chunk that is mid-deletion, the client is told to re-upload it (cheap, safe) rather than referencing bytes that may vanish. Correctness over a rare extra PUT.
4. The GC algorithm: hybrid incremental + periodic full mark-sweep
A full mark-and-sweep over billions of chunks/manifests every cycle is prohibitively expensive and is why tools like restic lock the repository during prune — unacceptable at our scale and uptime. We use a hybrid, online design:
(a) Incremental candidate-based GC (the common path).
- When a manifest is pruned/deleted, the affected
REF_EDGErows are removed; any chunk whose edge count hits zero is stampedunreferenced_at = nowand enqueued as a candidate. - The sweeper processes candidates whose grace has elapsed, re-checks zero refs under
CAS, respects
retain_until, then deletes (or marks pack dead space). - This touches only recently dead data — no global scan.
(b) Periodic full mark-sweep (the backstop).
- Because refcounts/edges can drift (crashes, bugs), a low-frequency full pass recomputes the live set from all manifests (per tenant, partitioned) and reconciles: anything live-but-marked-dead is rescued; anything dead-but-still- present beyond grace is reclaimed. Runs online (no lock): references created after the mark snapshot are treated as live (a write-barrier via edge-insert timestamps / mark epoch), so concurrent uploads are never collected.
flowchart TB
classDef m fill:#fde68a,stroke:#b45309,color:#111827;
classDef s fill:#fed7aa,stroke:#c2410c,color:#111827;
classDef d fill:#bbf7d0,stroke:#15803d,color:#111827;
prune["Version prune / trash purge / tenant delete ([07],[10])"]:::m --> edges["Remove REF_EDGEs"]:::m
edges --> cand{"chunk edge-count == 0?"}:::m
cand -- yes --> mark["Stamp unreferenced_at, enqueue candidate"]:::s
mark --> wait["Grace period elapses"]:::s
wait --> recheck{"still 0 refs AND retain_until passed? (CAS)"}:::s
recheck -- no --> resurrect["Back to Committed"]:::d
recheck -- yes --> del["Delete bytes / mark pack dead space"]:::d
full["Periodic full mark-sweep (online, partitioned)"]:::s -. reconciles drift .-> recheck
5. Pack compaction (reclaiming dead space)
You cannot delete one chunk from inside an immutable pack object. So:
- The Pack Index tracks
live_bytes / sizeper pack. - When live fraction drops below a threshold (e.g. <70%, cf. restic’s repack heuristic), the repacker copies the live chunks into a new pack, updates the index (CAS), and deletes the old pack (after grace).
- Compaction is batched and throttled (it reads/writes bytes through compute, co-located to minimize egress) and is the same machinery used for temperature repacking (10 §4).
Tradeoff: lower repack threshold = less wasted space but more IO/cost; higher = cheaper but more slack. Tunable; default leaves some slack (à la restic’s 5–10% allowance) because rewriting packs is expensive.
6. Packing vs presigned direct-upload (the reconciliation)
ADR-0011 says bytes bypass our compute; packing reads/writes bytes through compute. Reconciliation:
- User-facing transfer stays direct/presigned. Chunks land as individual staging objects (Mode A) or a whole staged object (Mode B) — no compute on the hot path.
- Packing is async maintenance, co-located with storage, on the committed long tail — not on the user’s path. It coalesces many small objects into packs, then GC deletes the redundant standalone objects after grace.
- Hot/recent/large chunks may stay standalone (skip packing) so the most-read data needs no reconstruction indirection.
So the presigned property holds for everything users observe; packing is internal, asynchronous, and bounded.
7. Crash safety, idempotency, isolation
- Idempotent & resumable: GC/repack use watermarks and CAS; re-running after a crash repeats no destructive step and completes partial ones. Deleting an already- deleted object is a no-op.
- Tombstones record intended deletes so a crashed sweep is auditable and resumable, and so eventual provider-list lag can’t resurrect a deleted chunk.
- Per-tenant, parallel: GC is partitioned by tenant (08) → scales horizontally and contains blast radius. Tenant deletion is a bulk GC of the tenant’s prefix.
- Never blocks the hot path: GC runs on replicas/partitions, throttled to an IO budget; uploads/downloads proceed during GC (unlike stop-the-world prune).
8. Tradeoffs / Alternatives / Scaling
Tradeoffs. The grace period is the dial between safety (longer = more sure no live data is deleted, tolerates clock skew/lag) and space efficiency (longer = dead bytes linger). We default conservative (e.g. 24–72h) because deleting live data is unrecoverable while leaking space is merely costly — an asymmetric risk that should always favor safety.
Alternatives considered.
- Pure refcounting: simplest, O(1) per op, but the §2 race + drift make it unsafe as the authority. We keep refcounts/edges as the fast path/candidate generator and add mark-sweep as the authority — the hybrid.
- Pure periodic full mark-sweep (restic-style): correct but expensive and historically requires locking. We keep it only as the online backstop; incremental candidate GC does the day-to-day work.
- Stop-the-world GC: impossible at our uptime requirements → online/concurrent GC with a mark epoch / write barrier.
- Immediate delete on last unref (no grace): exposes the dedup race directly. Rejected.
Scaling concerns.
- Mark cost of the full backstop is O(manifests) → partition by tenant, run on replicas, low frequency, incremental candidate GC handles the rest.
- Delete throughput → batch deletes where the provider supports it (01); pack-level deletes reclaim many chunks per op.
- Candidate queue can spike on mass deletion (tenant offboarding) → backpressured, throttled, resumable.
- Clock skew / list lag across providers → grace period + tombstones + CAS re-check absorb it; GC never trusts a single list snapshot.
References
- restic prune (mark-sweep over CDC chunks, repack threshold, repo lock): https://restic.readthedocs.io/en/stable/060_forget.html
- Tri-color / concurrent GC concepts (write barriers, mark epoch): https://en.wikipedia.org/wiki/Tracing_garbage_collection
- S3 batch delete (DeleteObjects): https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html
- S3 abort incomplete multipart (staging reclamation): https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html