02 — Content-Addressed Storage & Chunking

Topics: content-addressed storage, chunked uploads (the data model side). Decisions crystallized in ADR-0016 (hashing) and ADR-0017 (chunking).

Content-addressed storage (CAS) is the foundation everything else stands on. Get the hash and the chunk boundaries right and dedup, integrity, caching, resumability, and safe concurrency all fall out almost for free. Get them wrong and you pay in metadata explosion forever.


1. Why content addressing (the properties we are buying)

Naming a unit by hash(content) gives, for free:

Tradeoff: CAS forfeits locality (content order ≠ storage order) and makes deletion indirect (you can’t delete content someone else references → 11 GC). Both are worth it; both are designed for explicitly.


2. Hash function: BLAKE3 (see ADR-0016)

Decision: address chunks and objects with BLAKE3-256; offer SHA-256 as a per-deployment compliance mode.

  BLAKE3 SHA-256 (alt)
Throughput (1 core) ~8 GB/s (AVX2) ~3 GB/s (SHA-NI)
Multicore linear (Merkle tree) — ~90 GB/s/16 cores sequential, fixed
Verified streaming of ranges yes (internal Merkle tree) no
Incremental update yes no
Cryptanalysis maturity ~5 yrs ~23 yrs
FIPS / regulatory not FIPS FIPS 180-4

Why it matters here specifically: BLAKE3 is internally a Merkle tree over 1 KiB leaves, so we can verify an arbitrary byte range of a chunk without reading the whole chunk (04) and hash huge files in parallel on ingest. That property is uniquely valuable for a storage system doing range reads and background scrubbing.

Tradeoffs / Alternatives. BLAKE3’s shorter public-cryptanalysis history is the real cost; SHA-256’s maturity and FIPS status are why we keep it selectable for regulated deployments. We reject MD5/SHA-1 (collision-broken → a dedup system using a broken hash is a data-corruption vector, not just a security one: an attacker who finds a collision can make chunk A resolve to chunk B’s bytes). We reject non-cryptographic hashes (xxHash) for addressing — collision resistance is load-bearing for dedup correctness — though we do use them as cheap pre-filters.

Scaling. Hashing is on the ingest path for every byte; at PB scale the 2–3× throughput edge of BLAKE3 is real CPU/cost savings, and parallel hashing keeps large-file ingest from being single-core-bound.


3. Chunking strategy: content-defined chunking (see ADR-0017)

A file’s bytes must be split into chunks. Three strategies, escalating in power and cost:

Strategy Boundary rule Dedup quality Cost
Whole-file none (file = 1 chunk) only identical files dedup cheapest; trivial
Fixed-size every N bytes breaks on insert/delete shift (one byte inserted → every later boundary moves → zero dedup) cheap
Content-defined (CDC) boundary where rolling hash hits a pattern robust to insert/delete; near-optimal rolling hash over every byte

Decision: FastCDC (gear-based rolling hash, normalized chunking level 2), parameters min 256 KiB, average ~1 MiB, max 4 MiB.

The boundary-shift problem (why not fixed-size)

Fixed-size chunking fails the most common real edit: prepend/insert a byte and every subsequent fixed boundary shifts, so no later chunk matches the prior version — dedup and delta-sync collapse to zero. CDC places boundaries based on content, so an insert only disturbs the one or two chunks around it; the rest still match. This is why CDC is the standard for backup/sync systems (restic, borg, Tarsnap).

Why ~1 MiB average and not 16 KiB (the cardinality argument)

FastCDC research targets ~8–16 KiB averages for maximum dedup ratio. We deliberately choose a coarser ~1 MiB average. The reason is index cardinality, which is the binding constraint at our scale:

1 PB logical data ÷ chunk size = number of chunk-index rows
   @ 16 KiB avg → ~64,000,000,000 rows   (≈ 6.4 TB index @ ~100 B/row) — untenable
   @  1 MiB avg →    ~1,000,000,000 rows   (≈ 100 GB index)            — manageable
   @  4 MiB avg →      ~250,000,000 rows                               — Dropbox-like

The chunk index is hot, sharded, and joined on every upload/download; making it 64× smaller is worth giving up some marginal dedup ratio. This is the central chunking tradeoff: finer chunks = better dedup ratio but a metadata index that becomes the system bottleneck. (Dropbox lands at fixed 4 MiB for the same reason; we use CDC at ~1 MiB to keep insert-robustness while bounding cardinality.)

Normalized chunking (NC level 2) keeps the size distribution tight around the average so we don’t get a long tail of tiny chunks (more rows) or huge chunks (worse dedup / bad range granularity).

Where chunking happens (two modes)


4. The manifest model (object = ordered chunks)

An Object (a version’s content) is represented by a Manifest: an ordered list of {chunk_hash, offset, length}. The manifest is itself content-addressed (manifest_hash), and the Object’s content_hash = BLAKE3 root over the whole content, computed on ingest.

flowchart LR
    classDef o fill:#fde68a,stroke:#b45309,color:#111827;
    classDef c fill:#bbf7d0,stroke:#15803d,color:#111827;
    file["file v3 bytes"]:::o --> cdc{{"FastCDC + BLAKE3"}}:::o
    cdc --> c1["chunk a1b2…"]:::c
    cdc --> c2["chunk c3d4…"]:::c
    cdc --> c3["chunk e5f6…"]:::c
    c1 & c2 & c3 --> man["Manifest<br/>[a1b2@0, c3d4@1MiB, e5f6@2MiB]<br/>content_hash = root"]:::o

Tradeoffs. The manifest indirection costs one extra metadata read on download (mitigated by caching, 06) and storage for manifests (kilobytes — negligible). Alternative (store the chunk list inline on the version row) is simpler but loses content-addressed manifest dedup (identical files → identical manifest → stored once) and bloats the hot namespace table.


5. Chunk → Pack (forward reference)

Individual ~1 MiB chunks are still too many small provider objects at the long tail. A background Packer aggregates committed chunks into ~256 MiB–1 GiB pack objects, and the Pack Index maps chunk_hash → (pack_id, offset, length). Hot/recent and large chunks may stay standalone. Full rationale, the packing/GC interaction, and the presigned-vs-packing reconciliation are in 11 Garbage Collection and 05 Uploads.


6. Scaling summary

References