BitVault Storage Subsystem — Design

Audience: distributed-systems / storage engineers. Scope: the deep design of BitVault’s storage subsystem — the part that turns bytes into durable, deduplicated, content-addressed, verifiable, tiered storage across many cloud providers, at millions of files / billions of chunks / PB scale.

This set refines and elaborates the system-level decisions already recorded: ADR-0004 Postgres as truth, ADR-0005 storage abstraction, ADR-0011 presigned data plane, 04 bounded contexts, 05 service boundaries, 08 data model. Where this set deepens the high-level BLOB abstraction into Object → Chunk → Pack, that refinement is called out explicitly.

No implementation code — architecture and documentation only. Every decision states Tradeoffs / Alternatives / Scaling; the discrete contested choices are crystallized as ADRs 0016–0021.

0. Reading order

#	Doc	Topic(s) from the brief
—	this README	data model, service boundaries, data ownership, tenets
01	Object Storage Abstraction	object storage abstraction
02	Content-Addressed Storage & Chunking	content-addressed storage, chunked model
03	Deduplication	deduplication
04	Integrity & Checksums	checksums, data integrity verification
05	Uploads: Multipart / Chunked / Resumable	multipart, chunked, resumable uploads
06	Downloads & Reconstruction	download flows
07	Versioning	file versioning
08	Metadata Architecture	metadata architecture
09	Federation & Placement	storage federation
10	Tiering & Lifecycle	tiered storage, lifecycle policies
11	Garbage Collection	garbage collection

1. Design tenets (the rules everything else obeys)

Metadata is truth; bytes are replaceable. Postgres records what exists and where; object storage holds opaque, immutable, content-addressed bytes. A byte object with no metadata reference does not exist and is reclaimable (ADR-0004).
Immutability by content address. Stored units are named by the hash of their content. Identical content has one name, is written once, and never mutated — only referenced, dereferenced, or copied. This is what makes dedup, caching, integrity, and safe concurrency tractable.
The data plane never flows through the control plane. User bytes move directly between client and object storage via scoped presigned URLs (ADR-0011). Our compute touches bytes only for asynchronous maintenance (packing, scrubbing, tier migration), never on the user’s hot path.
Separate the dedup unit from the transfer unit from the storage unit. These are three different sizes for three different reasons (see §3). Conflating them is the most common storage-design mistake.
Everything derivable is rebuildable. Indexes, caches, and placement maps are reconstructible from manifests + object listings. Only the chunk bytes and the manifests are irreplaceable.
Tenant isolation is a storage boundary, not just a DB boundary. Dedup, placement, and keys are tenant-scoped by default (ADR-0007, ADR-0018) — closing the cross-tenant dedup side channel (Harnik 2010).
Design for the long tail. Millions of files means billions of small chunks. Per-object overhead, request cost, and index cardinality — not raw capacity — are the binding constraints. Pack small things into big things.

2. The layered data model (terminology — used everywhere)

A file’s bytes pass through four named layers. Get this vocabulary straight or the rest of the docs are ambiguous.

flowchart TB
    classDef ns fill:#dbeafe,stroke:#1e40af,color:#111827;
    classDef log fill:#fde68a,stroke:#b45309,color:#111827;
    classDef phys fill:#bbf7d0,stroke:#15803d,color:#111827;

    subgraph NS["Namespace context (File & Metadata) — see 08-data-model"]
      node["Node (file/folder)"]:::ns
      ver["Version"]:::ns
    end

    subgraph LOGICAL["Storage logical model (content-addressed)"]
      obj["Object = content of one version<br/>id = content_hash (BLAKE3 root)"]:::log
      man["Manifest = ordered list of chunk refs<br/>{chunk_hash, offset, length}"]:::log
      chunk["Chunk = unit of dedup & CAS<br/>id = chunk_hash (BLAKE3), variable size (CDC)"]:::log
    end

    subgraph PHYSICAL["Storage physical model (provider objects)"]
      pack["Pack = large container object<br/>(~256MiB–1GiB) aggregating many chunks"]:::phys
      pobj["Provider object<br/>(MinIO/S3/R2/GCS/Azure key)"]:::phys
    end

    node --> ver --> obj --> man --> chunk
    chunk -->|"packed (long tail)"| pack --> pobj
    chunk -->|"standalone (hot/large)"| pobj

Layer	Name	Identity	Size	Why it exists	Owner context
Namespace	Node / Version	UUID	—	the user-facing file & its history	File & Metadata (08)
Logical	Object	`content_hash`	= file size	one version’s content; the dedup root for whole-file dup	Storage
Logical	Manifest	`manifest_hash`	KB	maps an object to its ordered chunks; enables ranges & delta	Storage
Logical	Chunk	`chunk_hash`	~256 KiB–4 MiB (CDC, avg ~1 MiB)	the dedup & integrity unit	Storage
Physical	Pack	`pack_id`	~256 MiB–1 GiB	amortize per-object cost over many chunks	Storage
Physical	Provider object	provider key	chunk or pack	the actual bytes in a bucket	Storage (via adapter)

Refinement note: 08-data-model modeled a single content-addressed BLOB. The storage subsystem refines BLOB into Object (manifest) + Chunks + Packs. A whole-file-only deployment (no chunking) is the degenerate case where each Object = one Chunk = one provider object — supported as a config profile for small self-host installs.

3. Three sizes, three reasons (the tenet that prevents the classic mistake)

Unit	Typical size	Chosen to optimize	Constraint that sets it
Chunk (dedup)	~1 MiB avg (CDC)	dedup ratio vs index cardinality	smaller → better dedup but more index rows (02, 03)
Transfer part (multipart)	8–64 MiB	throughput, retry granularity, provider limits	S3: 5 MiB–5 GiB, ≤10,000 parts (05)
Pack (storage)	~256 MiB–1 GiB	per-object overhead, request/list cost	provider object overhead & GC repack cost (11)

A 64 MiB multipart part may contain dozens of chunks; a 1 GiB pack aggregates hundreds. These layers move independently. (Grounding: Dropbox stores 4 MiB blocks aggregated into 1 GiB buckets; restic packs CDC chunks into pack files.)

4. Internal service boundaries

The Storage bounded context (05) decomposes into these components. In v1 they are modules inside bitvaultd (ADR-0001); the workers are the first extraction candidates because their load profile (bulk, bursty, CPU/IO-bound) differs sharply from the control-plane API.

flowchart TB
    classDef api fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef idx fill:#fde68a,stroke:#b45309,color:#111827;
    classDef wk fill:#fed7aa,stroke:#c2410c,color:#111827;
    classDef store fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef ext fill:#e5e7eb,stroke:#6b7280,color:#111827;

    client["Client (smart/simple)"]:::ext

    subgraph CP["Control plane (sync, stateless)"]
      coord["Storage Coordinator API (gRPC)<br/>init · negotiate chunks · presign · commit · resolve"]:::api
      place["Placement Service<br/>provider/region/tier/bucket policy"]:::api
      adpt["Provider Adapters<br/>MinIO·S3·R2·GCS·Azure (capability-flagged)"]:::api
    end

    subgraph META["Storage metadata (Postgres, sharded by tenant)"]
      cidx[("Chunk Index<br/>chunk_hash → refcount, location, size, ckSum, tier, state")]:::idx
      mstore[("Manifest Store<br/>content_hash → ordered chunk refs")]:::idx
      pidx[("Pack Index<br/>pack_id → provider object, members")]:::idx
    end

    subgraph WK["Async maintenance workers (extractable)"]
      packer["Packer / Compactor"]:::wk
      gc["Garbage Collector"]:::wk
      scrub["Integrity Scrubber"]:::wk
      life["Lifecycle Engine"]:::wk
      mig["Tier / Migration Worker"]:::wk
    end

    subgraph DP["Data plane (bytes; bypasses compute)"]
      obj[("Object Storage (per provider)<br/>chunk objects · pack objects")]:::store
    end

    client -->|"REST/gRPC: control"| coord
    client -. "presigned PUT/GET: bytes" .-> obj
    coord --> cidx & mstore & pidx
    coord --> place --> adpt --> obj
    packer --> cidx & pidx & obj
    gc --> cidx & mstore & pidx & obj
    scrub --> cidx & obj
    life --> mstore & cidx
    mig --> place & cidx & pidx & obj

Component responsibilities

Component	Responsibility	Sync/Async	State it owns
Storage Coordinator API	upload init, chunk-existence negotiation, presigned URL issuance, commit (verify + write manifest), download resolution	sync	none (stateless)
Placement Service	choose provider/region/bucket/tier for new chunks & packs from policy (09, 10)	sync	placement policy (config); not data
Provider Adapters	capability-flagged drivers per provider (01)	sync	none
Chunk Index	dedup index + location map + refcount + checksum + tier + state per chunk	store	authoritative chunk metadata
Manifest Store	object content_hash → ordered chunk list	store	authoritative object→chunk mapping
Pack Index	pack_id → provider object + member chunks + free space	store	authoritative pack layout
Packer / Compactor	coalesce small chunks into packs; repack sparse packs (11)	async	none (mutates indexes transactionally)
Garbage Collector	mark-and-sweep unreferenced chunks/packs after grace (11)	async	GC run state / watermarks
Integrity Scrubber	background read-verify checksums; detect & repair bitrot/loss (04)	async	scrub cursors, error log
Lifecycle Engine	evaluate lifecycle policies → emit tier/expiry/purge actions (10)	async	policy eval cursors
Tier / Migration Worker	move chunks/packs between tiers/providers (09, 10)	async	migration job state

5. Data ownership (one writer per datum — the rule that enables extraction)

Datum	Sole writer	Readers (via API/events)	Store
Node, Version (namespace)	File & Metadata context	Storage (read content_hash on commit/resolve)	Postgres
Object / Manifest	Storage Coordinator (on commit)	Coordinator, GC, Lifecycle	Postgres (Manifest Store)
Chunk Index row (refcount, location, state)	Coordinator (create/ref), Packer (relocate), GC (delete) — serialized via row-level state machine, not free-for-all	Coordinator, Scrubber, Migration	Postgres (Chunk Index)
Pack Index	Packer (create), GC (delete), Migration (relocate)	Coordinator (download resolve)	Postgres (Pack Index)
Chunk/Pack provider object (bytes)	Coordinator/client (write-once), Packer (write packs), GC (delete), Migration (copy)	download (presigned GET)	Object storage
Placement policy	Admin/Platform	Placement Service	Config/Postgres
Refcount	a derived, reconciled counter — see 11 GC §”refcount vs mark-sweep” for why it is a hint, not the GC authority	—	Postgres

Critical ownership nuance: the Chunk Index row has multiple legitimate writers (create, relocate, delete) at different lifecycle stages. We serialize them through an explicit per-chunk state machine (writing → committed → packed → unreferenced → deleting) guarded by compare-and-swap on the row, not by “one service owns the table.” This is the single most concurrency-sensitive datum in the subsystem and is treated as such in 11.

6. Key invariants (correctness — tested, not hoped)

SI-1 (no orphan reference): a manifest is committed only after all its chunks are durably present and checksum-verified (05).
SI-2 (no premature delete): a chunk’s bytes are deleted only after it is unreferenced by every manifest and a grace period has elapsed — closing the dedup-vs-GC race (11).
SI-3 (content address is honest): stored bytes always hash to their name; violated bytes are quarantined and repaired (04).
SI-4 (tenant containment): a chunk is only ever referenced by manifests in the tenant that created it (per-tenant dedup) (03, ADR-0018).
SI-5 (rebuildable indexes): the Pack/Chunk location indexes can be rebuilt by scanning manifests + provider listings; loss of an index is an availability event, not a durability event.

ADR	Decision
0016	BLAKE3 for content & chunk addressing (SHA-256 compliance mode)
0017	Content-defined chunking (FastCDC) + packing
0018	Per-tenant, server-side deduplication
0019	Online mark-and-sweep GC with grace period
0020	Policy-driven placement & federation
0021	Dual resumable protocol (tus + provider multipart)

Plus inherited: 0004, 0005, 0011, 0007, 0014.

References (research grounding)

FastCDC (USENIX ATC ‘16) — content-defined chunking: https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
BLAKE3 — parallel Merkle-tree hash & verified streaming: https://github.com/BLAKE3-team/BLAKE3
Harnik, Pinkas, Shulman-Peleg, “Side Channels in Cloud Services: Deduplication in Cloud Storage” (2010): https://eprint.iacr.org/2016/977.pdf
AWS S3 multipart upload limits: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
tus resumable upload protocol: https://tus.io/protocols/resumable-upload
restic prune (mark-sweep over CDC chunks): https://restic.readthedocs.io/en/stable/060_forget.html
Dropbox Magic Pocket (content-addressed blocks + 1 GiB buckets): https://dropbox.tech/infrastructure/inside-the-magic-pocket