03 — Deduplication

Topic: deduplication. Decision in ADR-0018. Builds on content addressing (02). Dedup is where storage savings, a security boundary, and refcount concurrency collide.

1. Two orthogonal choices

Deduplication has two independent axes; pick a point on each.

Axis	Options	BitVault choice
Granularity	whole-file · fixed-block · content-defined chunk	chunk (CDC, 02)
Scope	per-user · per-tenant · global (cross-tenant)	per-tenant (ADR-0018)

Granularity is settled by CAS chunking (02). This doc is mostly about scope, because that choice is a security decision, not just an efficiency one.

2. Scope: per-tenant, server-side (the security argument)

Decision: dedup within a tenant; never across tenants. Globally identical chunks belonging to two tenants are stored twice.

Why not global dedup (the side channel)

Cross-user/cross-tenant deduplication creates a side channel (Harnik, Pinkas & Shulman-Peleg, 2010): if the system tells an uploader “we already have that chunk, skip the upload,” the uploader learns the chunk exists somewhere else. Attacks:

Existence/confirmation: confirm a specific file is stored by someone by trying to upload it and observing whether the upload is skipped.
Form-filling / low-entropy guessing: recover a secret field (salary on a tax form) by enumerating low-entropy variants and seeing which dedups.

Per-tenant scope eliminates this across the security boundary: a dedup hit only ever reveals existence to members of the same tenant, who already share access to that tenant’s data. Inside a tenant the side channel leaks nothing new; across tenants it cannot occur because dedup never crosses the boundary.

Why server-side (not client-trusted)

Even within a tenant, the chunk index is the authority, and a new reference is only created after the bytes are confirmed present (uploaded now, or already committed in this tenant and verified). A client cannot acquire a reference to data by merely presenting a hash it does not possess — dedup collapses storage, it never grants access. Access is always via namespace ACLs (04 contexts), never via chunk possession. (This neutralizes the “hash-as-capability” attack that plagues naive client-side dedup; cf. proof-of-ownership schemes, Halevi et al.)

Tradeoffs. Per-tenant scope loses the cross-tenant savings that public multi-tenant clouds enjoy (e.g. one copy of a popular installer shared by everyone). We accept that: the savings are real but the cross-tenant leak and the tenant- isolation violation (ADR-0007) are unacceptable defaults. Within a tenant — especially enterprise tenants with many users sharing documents and versions — the dedup ratio is still large (shared attachments, re-saved docs, versions with small edits).

Alternatives considered.

Global dedup with the random-threshold mitigation (Harnik RT): randomizes when the dedup-skip is exposed; reduces but does not eliminate leakage, and complicates the upload path. Rejected as default; could be a future opt-in for a “public shared library” feature with eyes open.
Global dedup with server-side-only dedup (client always uploads, server dedups): removes the client-observable channel but forfeits the bandwidth savings (the main reason to want global dedup) and still co-mingles tenant bytes. Rejected.
Convergent encryption for cross-tenant dedup of encrypted data: enables dedup on encrypted chunks but is subject to confirmation-of-file and learn-the-remaining-information attacks, and conflicts with our deferral of E2E (ADR-0014). Out of scope.

3. Mechanism (the negotiate-then-upload protocol)

For the smart (sync) client:

Client CDC-chunks the file, computes each chunk_hash.
Client calls NegotiateChunks([hashes]) → server checks the tenant’s chunk index and returns the subset that are missing.
Client uploads only missing chunks (direct-to-storage, presigned, 05).
Client Commit(manifest) → server verifies presence + checksums of all chunks, then writes the manifest and increments references (SI-1).

sequenceDiagram
    autonumber
    participant C as Smart Client
    participant S as Storage Coordinator
    participant IX as Chunk Index (tenant-scoped)
    participant O as Object Store
    C->>C: CDC chunk + BLAKE3 hash
    C->>S: NegotiateChunks([h1..hn])
    S->>IX: which of [h1..hn] exist in THIS tenant?
    IX-->>S: missing = [h2,h5]
    S-->>C: upload only [h2,h5] (presigned PUT urls)
    C->>O: PUT h2, h5 (direct)
    C->>S: Commit(manifest)
    S->>O: Head/verify h2,h5 present + checksum
    S->>IX: ref++ for [h1..hn] (CAS state machine)
    S-->>C: committed

Existing chunks (h1,h3,h4,…) are not re-uploaded — that is the dedup + delta- sync win, scoped safely to the tenant.

4. Refcounts, contention & the “popular chunk” problem

Dedup means a chunk can be referenced by many manifests. Reference accounting is the concurrency hotspot of the whole subsystem.

Hot counter contention: a popular chunk (e.g. a shared logo) gets many concurrent ref++/ref--. A single counter row is a write hotspot.
- Mitigation: sharded/striped counters (N sub-counters per chunk summed on read) or an append-only reference-edge table ((chunk_hash, manifest_hash) rows) that is counted, not mutated — turning contention into inserts. The edge table doubles as the GC mark set (11).
Refcount is a hint, not GC truth. Distributed refcounts drift under crashes and races. We treat the count as an optimization and let mark-and-sweep be the authority for deletion (11, ADR-0019). This is the key robustness decision: never delete bytes on the strength of a refcount alone.

5. Compression & encryption interplay (order matters)

Hash on plaintext, then compress, then encrypt at rest. Dedup keys off the plaintext chunk_hash (so logically-identical content dedups even if compressed differently); the stored chunk is compressed (zstd) and encrypted with server- managed keys (ADR-0014). The chunk index records compression and enc_key_id.
Why this order: compressing/encrypting before hashing would defeat dedup (nondeterministic ciphertext/streams → different hashes for identical content). Server-side encryption after dedup preserves dedup while keeping bytes encrypted at rest — the deliberate trade vs client-side E2E (which would forbid server-side dedup; see ADR-0014/NG3).

6. Scaling concerns

Index lookup per chunk on every upload → the chunk index must be low-latency and sharded by tenant (08); negotiate calls are batched (one round trip for N hashes).
Bloom/cuckoo pre-filter in front of the index to answer “definitely missing” cheaply and cut index reads for the common new-chunk case.
Refcount writes scale via edge-table inserts + striping, not hot counters.
Dedup ratio monitoring per tenant (a metric, not a guess) to validate the ~1 MiB chunk choice and detect pathological low-ratio tenants (e.g. all-unique encrypted blobs) where chunking is wasted CPU → such tenants can be switched to whole-file mode.

References

Harnik et al., side channels in dedup: https://eprint.iacr.org/2016/977.pdf
Halevi et al., Proofs of Ownership: https://eprint.iacr.org/2011/207.pdf
restic/borg per-repo (≈per-tenant) dedup model: https://restic.readthedocs.io/en/stable/