ADR-0018 — Per-tenant, whole-object server-side deduplication
- Status: Accepted
- Date: 2026-06-11 · Revised: 2026-06-12 (Architecture Freeze V1)
- Related: ADR-0007, ADR-0014, ADR-0016, ADR-0019, 08 §2 I2, 02 product goals NG6
V1 Freeze (2026-06-12): Accepted, whole-object. Blocker-1 resolution: V1 dedups at the whole-object granularity (a file’s content is one content-addressed blob), not chunk-level. Chunk/CDC dedup is deferred with ADR-0017. The per-tenant scope (the security decision) is unchanged and binding.
Context
Deduplication scope — per-user, per-tenant, or global (cross-tenant) — is a security decision, not just an efficiency one: cross-user/cross-tenant dedup creates a side channel (Harnik, Pinkas & Shulman-Peleg, 2010) that leaks file existence via observable upload-skips (probing, low-entropy form-filling attacks), and co-mingles bytes across the tenant isolation boundary (ADR-0007).
Independently, granularity (whole-object vs chunk) is an efficiency/complexity decision. The pre-freeze review (review §3.1) found the data model described whole-object blobs while ADRs 0017/0018/0019/0020 described a chunk/pack/manifest store — two contradictory architectures, both “Accepted.” V1 picks one.
Decision
- Granularity (V1): whole-object. A file’s content is a single
content-addressed
BLOB, identified by the BLAKE3 hash of its bytes (ADR-0016). Identical bytes uploaded again within the same tenant create a new reference (refcount++) instead of storing twice. No content-defined chunking in V1 (deferred, ADR-0017) — so a changed file is a new blob (whole re-upload), and dedup hits only on byte-identical files/versions. - Scope (binding): within a tenant only, never across tenants. Dedup is server-side authoritative — a reference is created only after the bytes are confirmed present in this tenant, and dedup never grants access (access is via namespace ACLs, not blob possession). Global cross-tenant dedup is a non-goal (NG6).
- Identity is per-tenant. The blob is keyed by
(tenant_id, content_hash)(08); object keys are tenant-prefixed ({tenant_id}/{content_hash}). Two tenants holding identical bytes are two independent blobs.
Consequences
Positive
- The cross-tenant side channel is eliminated by construction: a dedup hit only reveals existence to members of the same tenant, who already share access.
- Honors tenant isolation (ADR-0007) at the storage layer; clean tenant deletion (crypto-shred the tenant DEK, ADR-0014 — no shared bytes to untangle).
- Neutralizes the “hash-as-capability” attack.
- Whole-object dedup is trivial to implement and reason about (a lookup on
(tenant_id, content_hash)at commit) — right for a solo V1.
Negative / costs
- Whole-object granularity yields no cross-version delta savings: a 1-byte edit to a large file stores a whole new blob. Accepted for V1; recovered when ADR-0017 (CDC chunking) is un-deferred.
- Forfeits cross-tenant savings (one global copy of a popular file) — accepted; the leak and isolation violation are unacceptable defaults.
- Residual existence oracle (documented): the plaintext
content_hashis stored in a shared table, so a reader of Postgres (e.g. a rogue DBA) could correlate identical files across tenants even though the bytes are per-tenant-encrypted (ADR-0014). Closing this fully (per-tenant HMAC-keyed identity) is a Deferred hardening; accepted as residual risk for V1.
Alternatives considered
- Chunk-level dedup (CDC, ADR-0017): better savings and real sub-file delta, but adds a chunk index, packing, and a far more complex GC (the pre-freeze contradiction). Deferred to a post-V1 forcing function (large-file efficiency).
- Global dedup + random-threshold mitigation (Harnik RT): reduces but doesn’t eliminate leakage; complicates uploads. Possible future opt-in for an explicit “public shared library,” never a default.
- Convergent encryption for encrypted cross-tenant dedup: subject to confirmation-of-file attacks and conflicts with the encryption model (ADR-0014). Out of scope.
Scaling
Per-tenant scope aligns dedup with the tenant-sharded metadata model, so dedup lookups and refcounts shard naturally and never become a global hotspot.
References
- Harnik et al., Side Channels in Cloud Storage Deduplication: https://eprint.iacr.org/2016/977.pdf