02 — Content-Addressed Storage & Chunking
Topics: content-addressed storage, chunked uploads (the data model side). Decisions crystallized in ADR-0016 (hashing) and ADR-0017 (chunking).
Content-addressed storage (CAS) is the foundation everything else stands on. Get the hash and the chunk boundaries right and dedup, integrity, caching, resumability, and safe concurrency all fall out almost for free. Get them wrong and you pay in metadata explosion forever.
1. Why content addressing (the properties we are buying)
Naming a unit by hash(content) gives, for free:
- Deduplication — identical content has one name (→ 03).
- Integrity — the name is the checksum; verification is intrinsic (→ 04).
- Idempotent, lock-free writes — re-writing an existing hash is a no-op; racing writers converge (01 §4).
- Immutability — a name binds to exactly one byte sequence forever; “edits” create new names, never mutate. This eliminates update-consistency from the design.
- Cheap caching & sharing — a hash is a perfect cache key and a perfect cross-device “do you already have this?” token (→ delta sync, 05).
Tradeoff: CAS forfeits locality (content order ≠ storage order) and makes deletion indirect (you can’t delete content someone else references → 11 GC). Both are worth it; both are designed for explicitly.
2. Hash function: BLAKE3 (see ADR-0016)
Decision: address chunks and objects with BLAKE3-256; offer SHA-256 as a per-deployment compliance mode.
| BLAKE3 | SHA-256 (alt) | |
|---|---|---|
| Throughput (1 core) | ~8 GB/s (AVX2) | ~3 GB/s (SHA-NI) |
| Multicore | linear (Merkle tree) — ~90 GB/s/16 cores | sequential, fixed |
| Verified streaming of ranges | yes (internal Merkle tree) | no |
| Incremental update | yes | no |
| Cryptanalysis maturity | ~5 yrs | ~23 yrs |
| FIPS / regulatory | not FIPS | FIPS 180-4 |
Why it matters here specifically: BLAKE3 is internally a Merkle tree over 1 KiB leaves, so we can verify an arbitrary byte range of a chunk without reading the whole chunk (04) and hash huge files in parallel on ingest. That property is uniquely valuable for a storage system doing range reads and background scrubbing.
Tradeoffs / Alternatives. BLAKE3’s shorter public-cryptanalysis history is the real cost; SHA-256’s maturity and FIPS status are why we keep it selectable for regulated deployments. We reject MD5/SHA-1 (collision-broken → a dedup system using a broken hash is a data-corruption vector, not just a security one: an attacker who finds a collision can make chunk A resolve to chunk B’s bytes). We reject non-cryptographic hashes (xxHash) for addressing — collision resistance is load-bearing for dedup correctness — though we do use them as cheap pre-filters.
Scaling. Hashing is on the ingest path for every byte; at PB scale the 2–3× throughput edge of BLAKE3 is real CPU/cost savings, and parallel hashing keeps large-file ingest from being single-core-bound.
3. Chunking strategy: content-defined chunking (see ADR-0017)
A file’s bytes must be split into chunks. Three strategies, escalating in power and cost:
| Strategy | Boundary rule | Dedup quality | Cost |
|---|---|---|---|
| Whole-file | none (file = 1 chunk) | only identical files dedup | cheapest; trivial |
| Fixed-size | every N bytes | breaks on insert/delete shift (one byte inserted → every later boundary moves → zero dedup) | cheap |
| Content-defined (CDC) | boundary where rolling hash hits a pattern | robust to insert/delete; near-optimal | rolling hash over every byte |
Decision: FastCDC (gear-based rolling hash, normalized chunking level 2), parameters min 256 KiB, average ~1 MiB, max 4 MiB.
The boundary-shift problem (why not fixed-size)
Fixed-size chunking fails the most common real edit: prepend/insert a byte and every subsequent fixed boundary shifts, so no later chunk matches the prior version — dedup and delta-sync collapse to zero. CDC places boundaries based on content, so an insert only disturbs the one or two chunks around it; the rest still match. This is why CDC is the standard for backup/sync systems (restic, borg, Tarsnap).
Why ~1 MiB average and not 16 KiB (the cardinality argument)
FastCDC research targets ~8–16 KiB averages for maximum dedup ratio. We deliberately choose a coarser ~1 MiB average. The reason is index cardinality, which is the binding constraint at our scale:
1 PB logical data ÷ chunk size = number of chunk-index rows
@ 16 KiB avg → ~64,000,000,000 rows (≈ 6.4 TB index @ ~100 B/row) — untenable
@ 1 MiB avg → ~1,000,000,000 rows (≈ 100 GB index) — manageable
@ 4 MiB avg → ~250,000,000 rows — Dropbox-like
The chunk index is hot, sharded, and joined on every upload/download; making it 64× smaller is worth giving up some marginal dedup ratio. This is the central chunking tradeoff: finer chunks = better dedup ratio but a metadata index that becomes the system bottleneck. (Dropbox lands at fixed 4 MiB for the same reason; we use CDC at ~1 MiB to keep insert-robustness while bounding cardinality.)
Normalized chunking (NC level 2) keeps the size distribution tight around the average so we don’t get a long tail of tiny chunks (more rows) or huge chunks (worse dedup / bad range granularity).
Where chunking happens (two modes)
- Smart client (sync): chunks client-side, hashes, asks the server which chunk hashes are new, uploads only those → delta sync + bandwidth savings. (This is where the side-channel discipline of 03/ADR-0018 applies.)
- Simple client / REST / 3rd-party: uploads the whole file (resumable/multipart, 05); the server chunks asynchronously during ingest post-processing, or stores whole-file when chunking is disabled.
4. The manifest model (object = ordered chunks)
An Object (a version’s content) is represented by a Manifest: an ordered
list of {chunk_hash, offset, length}. The manifest is itself content-addressed
(manifest_hash), and the Object’s content_hash = BLAKE3 root over the whole
content, computed on ingest.
flowchart LR
classDef o fill:#fde68a,stroke:#b45309,color:#111827;
classDef c fill:#bbf7d0,stroke:#15803d,color:#111827;
file["file v3 bytes"]:::o --> cdc{{"FastCDC + BLAKE3"}}:::o
cdc --> c1["chunk a1b2…"]:::c
cdc --> c2["chunk c3d4…"]:::c
cdc --> c3["chunk e5f6…"]:::c
c1 & c2 & c3 --> man["Manifest<br/>[a1b2@0, c3d4@1MiB, e5f6@2MiB]<br/>content_hash = root"]:::o
- Ranges & partial reads: offset/length in the manifest let a download resolve exactly which chunks cover a requested byte range (06).
- Delta between versions: v3’s manifest shares most chunk hashes with v2’s; the diff is the set of new chunks. Versioning becomes nearly free (07).
- Huge files → manifest of manifests. A 5 TB file at 1 MiB chunks = ~5M chunk refs ≈ a large manifest. We cap manifest size and nest (a tree of manifests), so manifest size stays bounded and partial reads don’t load 5M refs. (This mirrors Git’s tree objects and IPFS’s DAG.)
Tradeoffs. The manifest indirection costs one extra metadata read on download (mitigated by caching, 06) and storage for manifests (kilobytes — negligible). Alternative (store the chunk list inline on the version row) is simpler but loses content-addressed manifest dedup (identical files → identical manifest → stored once) and bloats the hot namespace table.
5. Chunk → Pack (forward reference)
Individual ~1 MiB chunks are still too many small provider objects at the long tail.
A background Packer aggregates committed chunks into ~256 MiB–1 GiB pack
objects, and the Pack Index maps chunk_hash → (pack_id, offset, length).
Hot/recent and large chunks may stay standalone. Full rationale, the packing/GC
interaction, and the presigned-vs-packing reconciliation are in
11 Garbage Collection and
05 Uploads.
6. Scaling summary
- Index cardinality is the master constraint → coarse-ish CDC (~1 MiB) + packing.
- Hashing throughput on ingest → BLAKE3 parallel.
- Manifest size for huge files → nested manifests.
- Address stability across a hash migration (if BLAKE3 were ever deprecated): the chunk index stores the algorithm id per row; a migration re-hashes lazily on access + a background job, with both names coexisting during transition. Designed for, not built speculatively.
References
- FastCDC paper (boundaries, NC, sizes): https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
- BLAKE3 (Merkle tree, verified streaming): https://github.com/BLAKE3-team/BLAKE3
- restic CDC + pack files: https://restic.readthedocs.io/en/stable/
- Dropbox 4 MiB blocks: https://dropbox.tech/infrastructure/inside-the-magic-pocket