03 — Deduplication
Topic: deduplication. Decision in ADR-0018. Builds on content addressing (02). Dedup is where storage savings, a security boundary, and refcount concurrency collide.
1. Two orthogonal choices
Deduplication has two independent axes; pick a point on each.
| Axis | Options | BitVault choice |
|---|---|---|
| Granularity | whole-file · fixed-block · content-defined chunk | chunk (CDC, 02) |
| Scope | per-user · per-tenant · global (cross-tenant) | per-tenant (ADR-0018) |
Granularity is settled by CAS chunking (02). This doc is mostly about scope, because that choice is a security decision, not just an efficiency one.
2. Scope: per-tenant, server-side (the security argument)
Decision: dedup within a tenant; never across tenants. Globally identical chunks belonging to two tenants are stored twice.
Why not global dedup (the side channel)
Cross-user/cross-tenant deduplication creates a side channel (Harnik, Pinkas & Shulman-Peleg, 2010): if the system tells an uploader “we already have that chunk, skip the upload,” the uploader learns the chunk exists somewhere else. Attacks:
- Existence/confirmation: confirm a specific file is stored by someone by trying to upload it and observing whether the upload is skipped.
- Form-filling / low-entropy guessing: recover a secret field (salary on a tax form) by enumerating low-entropy variants and seeing which dedups.
Per-tenant scope eliminates this across the security boundary: a dedup hit only ever reveals existence to members of the same tenant, who already share access to that tenant’s data. Inside a tenant the side channel leaks nothing new; across tenants it cannot occur because dedup never crosses the boundary.
Why server-side (not client-trusted)
Even within a tenant, the chunk index is the authority, and a new reference is only created after the bytes are confirmed present (uploaded now, or already committed in this tenant and verified). A client cannot acquire a reference to data by merely presenting a hash it does not possess — dedup collapses storage, it never grants access. Access is always via namespace ACLs (04 contexts), never via chunk possession. (This neutralizes the “hash-as-capability” attack that plagues naive client-side dedup; cf. proof-of-ownership schemes, Halevi et al.)
Tradeoffs. Per-tenant scope loses the cross-tenant savings that public multi-tenant clouds enjoy (e.g. one copy of a popular installer shared by everyone). We accept that: the savings are real but the cross-tenant leak and the tenant- isolation violation (ADR-0007) are unacceptable defaults. Within a tenant — especially enterprise tenants with many users sharing documents and versions — the dedup ratio is still large (shared attachments, re-saved docs, versions with small edits).
Alternatives considered.
- Global dedup with the random-threshold mitigation (Harnik RT): randomizes when the dedup-skip is exposed; reduces but does not eliminate leakage, and complicates the upload path. Rejected as default; could be a future opt-in for a “public shared library” feature with eyes open.
- Global dedup with server-side-only dedup (client always uploads, server dedups): removes the client-observable channel but forfeits the bandwidth savings (the main reason to want global dedup) and still co-mingles tenant bytes. Rejected.
- Convergent encryption for cross-tenant dedup of encrypted data: enables dedup on encrypted chunks but is subject to confirmation-of-file and learn-the-remaining-information attacks, and conflicts with our deferral of E2E (ADR-0014). Out of scope.
3. Mechanism (the negotiate-then-upload protocol)
For the smart (sync) client:
- Client CDC-chunks the file, computes each
chunk_hash. - Client calls
NegotiateChunks([hashes])→ server checks the tenant’s chunk index and returns the subset that are missing. - Client uploads only missing chunks (direct-to-storage, presigned, 05).
- Client
Commit(manifest)→ server verifies presence + checksums of all chunks, then writes the manifest and increments references (SI-1).
sequenceDiagram
autonumber
participant C as Smart Client
participant S as Storage Coordinator
participant IX as Chunk Index (tenant-scoped)
participant O as Object Store
C->>C: CDC chunk + BLAKE3 hash
C->>S: NegotiateChunks([h1..hn])
S->>IX: which of [h1..hn] exist in THIS tenant?
IX-->>S: missing = [h2,h5]
S-->>C: upload only [h2,h5] (presigned PUT urls)
C->>O: PUT h2, h5 (direct)
C->>S: Commit(manifest)
S->>O: Head/verify h2,h5 present + checksum
S->>IX: ref++ for [h1..hn] (CAS state machine)
S-->>C: committed
Existing chunks (h1,h3,h4,…) are not re-uploaded — that is the dedup + delta-
sync win, scoped safely to the tenant.
4. Refcounts, contention & the “popular chunk” problem
Dedup means a chunk can be referenced by many manifests. Reference accounting is the concurrency hotspot of the whole subsystem.
- Hot counter contention: a popular chunk (e.g. a shared logo) gets many
concurrent
ref++/ref--. A single counter row is a write hotspot.- Mitigation: sharded/striped counters (N sub-counters per chunk summed on
read) or an append-only reference-edge table (
(chunk_hash, manifest_hash)rows) that is counted, not mutated — turning contention into inserts. The edge table doubles as the GC mark set (11).
- Mitigation: sharded/striped counters (N sub-counters per chunk summed on
read) or an append-only reference-edge table (
- Refcount is a hint, not GC truth. Distributed refcounts drift under crashes and races. We treat the count as an optimization and let mark-and-sweep be the authority for deletion (11, ADR-0019). This is the key robustness decision: never delete bytes on the strength of a refcount alone.
5. Compression & encryption interplay (order matters)
- Hash on plaintext, then compress, then encrypt at rest. Dedup keys off the
plaintext
chunk_hash(so logically-identical content dedups even if compressed differently); the stored chunk is compressed (zstd) and encrypted with server- managed keys (ADR-0014). The chunk index recordscompressionandenc_key_id. - Why this order: compressing/encrypting before hashing would defeat dedup (nondeterministic ciphertext/streams → different hashes for identical content). Server-side encryption after dedup preserves dedup while keeping bytes encrypted at rest — the deliberate trade vs client-side E2E (which would forbid server-side dedup; see ADR-0014/NG3).
6. Scaling concerns
- Index lookup per chunk on every upload → the chunk index must be low-latency and sharded by tenant (08); negotiate calls are batched (one round trip for N hashes).
- Bloom/cuckoo pre-filter in front of the index to answer “definitely missing” cheaply and cut index reads for the common new-chunk case.
- Refcount writes scale via edge-table inserts + striping, not hot counters.
- Dedup ratio monitoring per tenant (a metric, not a guess) to validate the ~1 MiB chunk choice and detect pathological low-ratio tenants (e.g. all-unique encrypted blobs) where chunking is wasted CPU → such tenants can be switched to whole-file mode.
References
- Harnik et al., side channels in dedup: https://eprint.iacr.org/2016/977.pdf
- Halevi et al., Proofs of Ownership: https://eprint.iacr.org/2011/207.pdf
- restic/borg per-repo (≈per-tenant) dedup model: https://restic.readthedocs.io/en/stable/