03 — Deduplication

Topic: deduplication. Decision in ADR-0018. Builds on content addressing (02). Dedup is where storage savings, a security boundary, and refcount concurrency collide.


1. Two orthogonal choices

Deduplication has two independent axes; pick a point on each.

Axis Options BitVault choice
Granularity whole-file · fixed-block · content-defined chunk chunk (CDC, 02)
Scope per-user · per-tenant · global (cross-tenant) per-tenant (ADR-0018)

Granularity is settled by CAS chunking (02). This doc is mostly about scope, because that choice is a security decision, not just an efficiency one.


2. Scope: per-tenant, server-side (the security argument)

Decision: dedup within a tenant; never across tenants. Globally identical chunks belonging to two tenants are stored twice.

Why not global dedup (the side channel)

Cross-user/cross-tenant deduplication creates a side channel (Harnik, Pinkas & Shulman-Peleg, 2010): if the system tells an uploader “we already have that chunk, skip the upload,” the uploader learns the chunk exists somewhere else. Attacks:

Per-tenant scope eliminates this across the security boundary: a dedup hit only ever reveals existence to members of the same tenant, who already share access to that tenant’s data. Inside a tenant the side channel leaks nothing new; across tenants it cannot occur because dedup never crosses the boundary.

Why server-side (not client-trusted)

Even within a tenant, the chunk index is the authority, and a new reference is only created after the bytes are confirmed present (uploaded now, or already committed in this tenant and verified). A client cannot acquire a reference to data by merely presenting a hash it does not possess — dedup collapses storage, it never grants access. Access is always via namespace ACLs (04 contexts), never via chunk possession. (This neutralizes the “hash-as-capability” attack that plagues naive client-side dedup; cf. proof-of-ownership schemes, Halevi et al.)

Tradeoffs. Per-tenant scope loses the cross-tenant savings that public multi-tenant clouds enjoy (e.g. one copy of a popular installer shared by everyone). We accept that: the savings are real but the cross-tenant leak and the tenant- isolation violation (ADR-0007) are unacceptable defaults. Within a tenant — especially enterprise tenants with many users sharing documents and versions — the dedup ratio is still large (shared attachments, re-saved docs, versions with small edits).

Alternatives considered.


3. Mechanism (the negotiate-then-upload protocol)

For the smart (sync) client:

  1. Client CDC-chunks the file, computes each chunk_hash.
  2. Client calls NegotiateChunks([hashes]) → server checks the tenant’s chunk index and returns the subset that are missing.
  3. Client uploads only missing chunks (direct-to-storage, presigned, 05).
  4. Client Commit(manifest) → server verifies presence + checksums of all chunks, then writes the manifest and increments references (SI-1).
sequenceDiagram
    autonumber
    participant C as Smart Client
    participant S as Storage Coordinator
    participant IX as Chunk Index (tenant-scoped)
    participant O as Object Store
    C->>C: CDC chunk + BLAKE3 hash
    C->>S: NegotiateChunks([h1..hn])
    S->>IX: which of [h1..hn] exist in THIS tenant?
    IX-->>S: missing = [h2,h5]
    S-->>C: upload only [h2,h5] (presigned PUT urls)
    C->>O: PUT h2, h5 (direct)
    C->>S: Commit(manifest)
    S->>O: Head/verify h2,h5 present + checksum
    S->>IX: ref++ for [h1..hn] (CAS state machine)
    S-->>C: committed

Existing chunks (h1,h3,h4,…) are not re-uploaded — that is the dedup + delta- sync win, scoped safely to the tenant.


Dedup means a chunk can be referenced by many manifests. Reference accounting is the concurrency hotspot of the whole subsystem.


5. Compression & encryption interplay (order matters)


6. Scaling concerns

References