BitVault Storage Subsystem — Design

Audience: distributed-systems / storage engineers. Scope: the deep design of BitVault’s storage subsystem — the part that turns bytes into durable, deduplicated, content-addressed, verifiable, tiered storage across many cloud providers, at millions of files / billions of chunks / PB scale.

This set refines and elaborates the system-level decisions already recorded: ADR-0004 Postgres as truth, ADR-0005 storage abstraction, ADR-0011 presigned data plane, 04 bounded contexts, 05 service boundaries, 08 data model. Where this set deepens the high-level BLOB abstraction into Object → Chunk → Pack, that refinement is called out explicitly.

No implementation code — architecture and documentation only. Every decision states Tradeoffs / Alternatives / Scaling; the discrete contested choices are crystallized as ADRs 0016–0021.


0. Reading order

# Doc Topic(s) from the brief
this README data model, service boundaries, data ownership, tenets
01 Object Storage Abstraction object storage abstraction
02 Content-Addressed Storage & Chunking content-addressed storage, chunked model
03 Deduplication deduplication
04 Integrity & Checksums checksums, data integrity verification
05 Uploads: Multipart / Chunked / Resumable multipart, chunked, resumable uploads
06 Downloads & Reconstruction download flows
07 Versioning file versioning
08 Metadata Architecture metadata architecture
09 Federation & Placement storage federation
10 Tiering & Lifecycle tiered storage, lifecycle policies
11 Garbage Collection garbage collection

1. Design tenets (the rules everything else obeys)

  1. Metadata is truth; bytes are replaceable. Postgres records what exists and where; object storage holds opaque, immutable, content-addressed bytes. A byte object with no metadata reference does not exist and is reclaimable (ADR-0004).
  2. Immutability by content address. Stored units are named by the hash of their content. Identical content has one name, is written once, and never mutated — only referenced, dereferenced, or copied. This is what makes dedup, caching, integrity, and safe concurrency tractable.
  3. The data plane never flows through the control plane. User bytes move directly between client and object storage via scoped presigned URLs (ADR-0011). Our compute touches bytes only for asynchronous maintenance (packing, scrubbing, tier migration), never on the user’s hot path.
  4. Separate the dedup unit from the transfer unit from the storage unit. These are three different sizes for three different reasons (see §3). Conflating them is the most common storage-design mistake.
  5. Everything derivable is rebuildable. Indexes, caches, and placement maps are reconstructible from manifests + object listings. Only the chunk bytes and the manifests are irreplaceable.
  6. Tenant isolation is a storage boundary, not just a DB boundary. Dedup, placement, and keys are tenant-scoped by default (ADR-0007, ADR-0018) — closing the cross-tenant dedup side channel (Harnik 2010).
  7. Design for the long tail. Millions of files means billions of small chunks. Per-object overhead, request cost, and index cardinality — not raw capacity — are the binding constraints. Pack small things into big things.

2. The layered data model (terminology — used everywhere)

A file’s bytes pass through four named layers. Get this vocabulary straight or the rest of the docs are ambiguous.

flowchart TB
    classDef ns fill:#dbeafe,stroke:#1e40af,color:#111827;
    classDef log fill:#fde68a,stroke:#b45309,color:#111827;
    classDef phys fill:#bbf7d0,stroke:#15803d,color:#111827;

    subgraph NS["Namespace context (File & Metadata) — see 08-data-model"]
      node["Node (file/folder)"]:::ns
      ver["Version"]:::ns
    end

    subgraph LOGICAL["Storage logical model (content-addressed)"]
      obj["Object = content of one version<br/>id = content_hash (BLAKE3 root)"]:::log
      man["Manifest = ordered list of chunk refs<br/>{chunk_hash, offset, length}"]:::log
      chunk["Chunk = unit of dedup & CAS<br/>id = chunk_hash (BLAKE3), variable size (CDC)"]:::log
    end

    subgraph PHYSICAL["Storage physical model (provider objects)"]
      pack["Pack = large container object<br/>(~256MiB–1GiB) aggregating many chunks"]:::phys
      pobj["Provider object<br/>(MinIO/S3/R2/GCS/Azure key)"]:::phys
    end

    node --> ver --> obj --> man --> chunk
    chunk -->|"packed (long tail)"| pack --> pobj
    chunk -->|"standalone (hot/large)"| pobj
Layer Name Identity Size Why it exists Owner context
Namespace Node / Version UUID the user-facing file & its history File & Metadata (08)
Logical Object content_hash = file size one version’s content; the dedup root for whole-file dup Storage
Logical Manifest manifest_hash KB maps an object to its ordered chunks; enables ranges & delta Storage
Logical Chunk chunk_hash ~256 KiB–4 MiB (CDC, avg ~1 MiB) the dedup & integrity unit Storage
Physical Pack pack_id ~256 MiB–1 GiB amortize per-object cost over many chunks Storage
Physical Provider object provider key chunk or pack the actual bytes in a bucket Storage (via adapter)

Refinement note: 08-data-model modeled a single content-addressed BLOB. The storage subsystem refines BLOB into Object (manifest) + Chunks + Packs. A whole-file-only deployment (no chunking) is the degenerate case where each Object = one Chunk = one provider object — supported as a config profile for small self-host installs.


3. Three sizes, three reasons (the tenet that prevents the classic mistake)

Unit Typical size Chosen to optimize Constraint that sets it
Chunk (dedup) ~1 MiB avg (CDC) dedup ratio vs index cardinality smaller → better dedup but more index rows (02, 03)
Transfer part (multipart) 8–64 MiB throughput, retry granularity, provider limits S3: 5 MiB–5 GiB, ≤10,000 parts (05)
Pack (storage) ~256 MiB–1 GiB per-object overhead, request/list cost provider object overhead & GC repack cost (11)

A 64 MiB multipart part may contain dozens of chunks; a 1 GiB pack aggregates hundreds. These layers move independently. (Grounding: Dropbox stores 4 MiB blocks aggregated into 1 GiB buckets; restic packs CDC chunks into pack files.)


4. Internal service boundaries

The Storage bounded context (05) decomposes into these components. In v1 they are modules inside bitvaultd (ADR-0001); the workers are the first extraction candidates because their load profile (bulk, bursty, CPU/IO-bound) differs sharply from the control-plane API.

flowchart TB
    classDef api fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef idx fill:#fde68a,stroke:#b45309,color:#111827;
    classDef wk fill:#fed7aa,stroke:#c2410c,color:#111827;
    classDef store fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef ext fill:#e5e7eb,stroke:#6b7280,color:#111827;

    client["Client (smart/simple)"]:::ext

    subgraph CP["Control plane (sync, stateless)"]
      coord["Storage Coordinator API (gRPC)<br/>init · negotiate chunks · presign · commit · resolve"]:::api
      place["Placement Service<br/>provider/region/tier/bucket policy"]:::api
      adpt["Provider Adapters<br/>MinIO·S3·R2·GCS·Azure (capability-flagged)"]:::api
    end

    subgraph META["Storage metadata (Postgres, sharded by tenant)"]
      cidx[("Chunk Index<br/>chunk_hash → refcount, location, size, ckSum, tier, state")]:::idx
      mstore[("Manifest Store<br/>content_hash → ordered chunk refs")]:::idx
      pidx[("Pack Index<br/>pack_id → provider object, members")]:::idx
    end

    subgraph WK["Async maintenance workers (extractable)"]
      packer["Packer / Compactor"]:::wk
      gc["Garbage Collector"]:::wk
      scrub["Integrity Scrubber"]:::wk
      life["Lifecycle Engine"]:::wk
      mig["Tier / Migration Worker"]:::wk
    end

    subgraph DP["Data plane (bytes; bypasses compute)"]
      obj[("Object Storage (per provider)<br/>chunk objects · pack objects")]:::store
    end

    client -->|"REST/gRPC: control"| coord
    client -. "presigned PUT/GET: bytes" .-> obj
    coord --> cidx & mstore & pidx
    coord --> place --> adpt --> obj
    packer --> cidx & pidx & obj
    gc --> cidx & mstore & pidx & obj
    scrub --> cidx & obj
    life --> mstore & cidx
    mig --> place & cidx & pidx & obj

Component responsibilities

Component Responsibility Sync/Async State it owns
Storage Coordinator API upload init, chunk-existence negotiation, presigned URL issuance, commit (verify + write manifest), download resolution sync none (stateless)
Placement Service choose provider/region/bucket/tier for new chunks & packs from policy (09, 10) sync placement policy (config); not data
Provider Adapters capability-flagged drivers per provider (01) sync none
Chunk Index dedup index + location map + refcount + checksum + tier + state per chunk store authoritative chunk metadata
Manifest Store object content_hash → ordered chunk list store authoritative object→chunk mapping
Pack Index pack_id → provider object + member chunks + free space store authoritative pack layout
Packer / Compactor coalesce small chunks into packs; repack sparse packs (11) async none (mutates indexes transactionally)
Garbage Collector mark-and-sweep unreferenced chunks/packs after grace (11) async GC run state / watermarks
Integrity Scrubber background read-verify checksums; detect & repair bitrot/loss (04) async scrub cursors, error log
Lifecycle Engine evaluate lifecycle policies → emit tier/expiry/purge actions (10) async policy eval cursors
Tier / Migration Worker move chunks/packs between tiers/providers (09, 10) async migration job state

5. Data ownership (one writer per datum — the rule that enables extraction)

Datum Sole writer Readers (via API/events) Store
Node, Version (namespace) File & Metadata context Storage (read content_hash on commit/resolve) Postgres
Object / Manifest Storage Coordinator (on commit) Coordinator, GC, Lifecycle Postgres (Manifest Store)
Chunk Index row (refcount, location, state) Coordinator (create/ref), Packer (relocate), GC (delete) — serialized via row-level state machine, not free-for-all Coordinator, Scrubber, Migration Postgres (Chunk Index)
Pack Index Packer (create), GC (delete), Migration (relocate) Coordinator (download resolve) Postgres (Pack Index)
Chunk/Pack provider object (bytes) Coordinator/client (write-once), Packer (write packs), GC (delete), Migration (copy) download (presigned GET) Object storage
Placement policy Admin/Platform Placement Service Config/Postgres
Refcount a derived, reconciled counter — see 11 GC §”refcount vs mark-sweep” for why it is a hint, not the GC authority Postgres

Critical ownership nuance: the Chunk Index row has multiple legitimate writers (create, relocate, delete) at different lifecycle stages. We serialize them through an explicit per-chunk state machine (writing → committed → packed → unreferenced → deleting) guarded by compare-and-swap on the row, not by “one service owns the table.” This is the single most concurrency-sensitive datum in the subsystem and is treated as such in 11.


6. Key invariants (correctness — tested, not hoped)


ADR Decision
0016 BLAKE3 for content & chunk addressing (SHA-256 compliance mode)
0017 Content-defined chunking (FastCDC) + packing
0018 Per-tenant, server-side deduplication
0019 Online mark-and-sweep GC with grace period
0020 Policy-driven placement & federation
0021 Dual resumable protocol (tus + provider multipart)

Plus inherited: 0004, 0005, 0011, 0007, 0014.

References (research grounding)