ADR-0017 — Content-defined chunking (FastCDC) + packing
- Status: Deferred
- Date: 2026-06-11
- Related: storage/02 content-addressing, storage/03 dedup, storage/11 GC, storage/08 metadata
V1 Freeze (2026-06-12): Deferred. V1 uses whole-object blobs (ADR-0018); no content-defined chunking or packing. Re-opens when large-file delta/dedup efficiency is a demonstrated need (post-V1).
Context
To dedup and delta-sync, files are split into chunks named by content hash. The split strategy determines dedup quality, and the chunk size determines metadata index cardinality — which is the binding constraint at millions of files / billions of chunks (storage/08). Separately, storing each ~1 MiB chunk as its own provider object is untenable at scale (per-object overhead, request/list cost).
Decision
- Content-defined chunking via FastCDC (gear rolling hash, normalized chunking level 2), parameters min 256 KiB / avg ~1 MiB / max 4 MiB.
- Do not chunk small files (≤ ~1–4 MiB): store whole (one chunk = one object) — trivial downloads, no index bloat.
- Pack committed chunks into ~256 MiB–1 GiB pack objects, with a Pack Index
mapping
chunk_hash → (pack_id, offset, len); hot/large chunks may stay standalone. Packing is async, co-located maintenance — the user-facing transfer stays direct/ presigned (ADR-0011, storage/11 §6).
Consequences
Positive
- CDC is robust to insert/delete shifts (fixed-size chunking collapses to zero dedup on a 1-byte insert) → real delta-sync and cross-version dedup (storage/07).
- ~1 MiB average keeps the chunk index ~64× smaller than 16 KiB chunking while retaining strong dedup — the central cardinality tradeoff.
- Packing collapses billions of tiny objects into millions of packs → manageable request cost, listing, and GC.
Negative / costs
- CDC rolling hash costs CPU over every byte (cheap with BLAKE3/gear; still nonzero).
- Coarser chunks dedup slightly worse than fine chunks — accepted for cardinality.
- Packing adds a maintenance worker, a Pack Index, and pack compaction/repack logic (storage/11).
Alternatives considered
- Fixed-size blocks (e.g. Dropbox 4 MiB): simpler, bounded cardinality, but loses insert-robustness. We get most of the cardinality benefit at 1 MiB CDC and keep insert-robustness.
- Fine CDC (~16 KiB, restic/borg): best dedup ratio, but 64× index cardinality → the index becomes the bottleneck at our scale. Rejected.
- Whole-file only: trivial, but no delta-sync and dedup only on identical files. Used as the small-file and degenerate-deployment case, not the default for large files.
Scaling
Cardinality is governed by chunk size (coarse CDC) and object count by packing; both are the levers that let metadata (storage/08) and request cost (storage/01) scale.