BitVault Storage Subsystem — Design
Audience: distributed-systems / storage engineers. Scope: the deep design of BitVault’s storage subsystem — the part that turns bytes into durable, deduplicated, content-addressed, verifiable, tiered storage across many cloud providers, at millions of files / billions of chunks / PB scale.
This set refines and elaborates the system-level decisions already recorded: ADR-0004 Postgres as truth, ADR-0005 storage abstraction, ADR-0011 presigned data plane, 04 bounded contexts, 05 service boundaries, 08 data model. Where this set deepens the high-level
BLOBabstraction intoObject → Chunk → Pack, that refinement is called out explicitly.No implementation code — architecture and documentation only. Every decision states Tradeoffs / Alternatives / Scaling; the discrete contested choices are crystallized as ADRs 0016–0021.
0. Reading order
| # | Doc | Topic(s) from the brief |
|---|---|---|
| — | this README | data model, service boundaries, data ownership, tenets |
| 01 | Object Storage Abstraction | object storage abstraction |
| 02 | Content-Addressed Storage & Chunking | content-addressed storage, chunked model |
| 03 | Deduplication | deduplication |
| 04 | Integrity & Checksums | checksums, data integrity verification |
| 05 | Uploads: Multipart / Chunked / Resumable | multipart, chunked, resumable uploads |
| 06 | Downloads & Reconstruction | download flows |
| 07 | Versioning | file versioning |
| 08 | Metadata Architecture | metadata architecture |
| 09 | Federation & Placement | storage federation |
| 10 | Tiering & Lifecycle | tiered storage, lifecycle policies |
| 11 | Garbage Collection | garbage collection |
1. Design tenets (the rules everything else obeys)
- Metadata is truth; bytes are replaceable. Postgres records what exists and where; object storage holds opaque, immutable, content-addressed bytes. A byte object with no metadata reference does not exist and is reclaimable (ADR-0004).
- Immutability by content address. Stored units are named by the hash of their content. Identical content has one name, is written once, and never mutated — only referenced, dereferenced, or copied. This is what makes dedup, caching, integrity, and safe concurrency tractable.
- The data plane never flows through the control plane. User bytes move directly between client and object storage via scoped presigned URLs (ADR-0011). Our compute touches bytes only for asynchronous maintenance (packing, scrubbing, tier migration), never on the user’s hot path.
- Separate the dedup unit from the transfer unit from the storage unit. These are three different sizes for three different reasons (see §3). Conflating them is the most common storage-design mistake.
- Everything derivable is rebuildable. Indexes, caches, and placement maps are reconstructible from manifests + object listings. Only the chunk bytes and the manifests are irreplaceable.
- Tenant isolation is a storage boundary, not just a DB boundary. Dedup, placement, and keys are tenant-scoped by default (ADR-0007, ADR-0018) — closing the cross-tenant dedup side channel (Harnik 2010).
- Design for the long tail. Millions of files means billions of small chunks. Per-object overhead, request cost, and index cardinality — not raw capacity — are the binding constraints. Pack small things into big things.
2. The layered data model (terminology — used everywhere)
A file’s bytes pass through four named layers. Get this vocabulary straight or the rest of the docs are ambiguous.
flowchart TB
classDef ns fill:#dbeafe,stroke:#1e40af,color:#111827;
classDef log fill:#fde68a,stroke:#b45309,color:#111827;
classDef phys fill:#bbf7d0,stroke:#15803d,color:#111827;
subgraph NS["Namespace context (File & Metadata) — see 08-data-model"]
node["Node (file/folder)"]:::ns
ver["Version"]:::ns
end
subgraph LOGICAL["Storage logical model (content-addressed)"]
obj["Object = content of one version<br/>id = content_hash (BLAKE3 root)"]:::log
man["Manifest = ordered list of chunk refs<br/>{chunk_hash, offset, length}"]:::log
chunk["Chunk = unit of dedup & CAS<br/>id = chunk_hash (BLAKE3), variable size (CDC)"]:::log
end
subgraph PHYSICAL["Storage physical model (provider objects)"]
pack["Pack = large container object<br/>(~256MiB–1GiB) aggregating many chunks"]:::phys
pobj["Provider object<br/>(MinIO/S3/R2/GCS/Azure key)"]:::phys
end
node --> ver --> obj --> man --> chunk
chunk -->|"packed (long tail)"| pack --> pobj
chunk -->|"standalone (hot/large)"| pobj
| Layer | Name | Identity | Size | Why it exists | Owner context |
|---|---|---|---|---|---|
| Namespace | Node / Version | UUID | — | the user-facing file & its history | File & Metadata (08) |
| Logical | Object | content_hash |
= file size | one version’s content; the dedup root for whole-file dup | Storage |
| Logical | Manifest | manifest_hash |
KB | maps an object to its ordered chunks; enables ranges & delta | Storage |
| Logical | Chunk | chunk_hash |
~256 KiB–4 MiB (CDC, avg ~1 MiB) | the dedup & integrity unit | Storage |
| Physical | Pack | pack_id |
~256 MiB–1 GiB | amortize per-object cost over many chunks | Storage |
| Physical | Provider object | provider key | chunk or pack | the actual bytes in a bucket | Storage (via adapter) |
Refinement note: 08-data-model modeled a single content-addressed
BLOB. The storage subsystem refinesBLOBinto Object (manifest) + Chunks + Packs. A whole-file-only deployment (no chunking) is the degenerate case where each Object = one Chunk = one provider object — supported as a config profile for small self-host installs.
3. Three sizes, three reasons (the tenet that prevents the classic mistake)
| Unit | Typical size | Chosen to optimize | Constraint that sets it |
|---|---|---|---|
| Chunk (dedup) | ~1 MiB avg (CDC) | dedup ratio vs index cardinality | smaller → better dedup but more index rows (02, 03) |
| Transfer part (multipart) | 8–64 MiB | throughput, retry granularity, provider limits | S3: 5 MiB–5 GiB, ≤10,000 parts (05) |
| Pack (storage) | ~256 MiB–1 GiB | per-object overhead, request/list cost | provider object overhead & GC repack cost (11) |
A 64 MiB multipart part may contain dozens of chunks; a 1 GiB pack aggregates hundreds. These layers move independently. (Grounding: Dropbox stores 4 MiB blocks aggregated into 1 GiB buckets; restic packs CDC chunks into pack files.)
4. Internal service boundaries
The Storage bounded context (05) decomposes into
these components. In v1 they are modules inside bitvaultd (ADR-0001); the
workers are the first extraction candidates because their load profile (bulk,
bursty, CPU/IO-bound) differs sharply from the control-plane API.
flowchart TB
classDef api fill:#c7d2fe,stroke:#3730a3,color:#111827;
classDef idx fill:#fde68a,stroke:#b45309,color:#111827;
classDef wk fill:#fed7aa,stroke:#c2410c,color:#111827;
classDef store fill:#bbf7d0,stroke:#15803d,color:#111827;
classDef ext fill:#e5e7eb,stroke:#6b7280,color:#111827;
client["Client (smart/simple)"]:::ext
subgraph CP["Control plane (sync, stateless)"]
coord["Storage Coordinator API (gRPC)<br/>init · negotiate chunks · presign · commit · resolve"]:::api
place["Placement Service<br/>provider/region/tier/bucket policy"]:::api
adpt["Provider Adapters<br/>MinIO·S3·R2·GCS·Azure (capability-flagged)"]:::api
end
subgraph META["Storage metadata (Postgres, sharded by tenant)"]
cidx[("Chunk Index<br/>chunk_hash → refcount, location, size, ckSum, tier, state")]:::idx
mstore[("Manifest Store<br/>content_hash → ordered chunk refs")]:::idx
pidx[("Pack Index<br/>pack_id → provider object, members")]:::idx
end
subgraph WK["Async maintenance workers (extractable)"]
packer["Packer / Compactor"]:::wk
gc["Garbage Collector"]:::wk
scrub["Integrity Scrubber"]:::wk
life["Lifecycle Engine"]:::wk
mig["Tier / Migration Worker"]:::wk
end
subgraph DP["Data plane (bytes; bypasses compute)"]
obj[("Object Storage (per provider)<br/>chunk objects · pack objects")]:::store
end
client -->|"REST/gRPC: control"| coord
client -. "presigned PUT/GET: bytes" .-> obj
coord --> cidx & mstore & pidx
coord --> place --> adpt --> obj
packer --> cidx & pidx & obj
gc --> cidx & mstore & pidx & obj
scrub --> cidx & obj
life --> mstore & cidx
mig --> place & cidx & pidx & obj
Component responsibilities
| Component | Responsibility | Sync/Async | State it owns |
|---|---|---|---|
| Storage Coordinator API | upload init, chunk-existence negotiation, presigned URL issuance, commit (verify + write manifest), download resolution | sync | none (stateless) |
| Placement Service | choose provider/region/bucket/tier for new chunks & packs from policy (09, 10) | sync | placement policy (config); not data |
| Provider Adapters | capability-flagged drivers per provider (01) | sync | none |
| Chunk Index | dedup index + location map + refcount + checksum + tier + state per chunk | store | authoritative chunk metadata |
| Manifest Store | object content_hash → ordered chunk list | store | authoritative object→chunk mapping |
| Pack Index | pack_id → provider object + member chunks + free space | store | authoritative pack layout |
| Packer / Compactor | coalesce small chunks into packs; repack sparse packs (11) | async | none (mutates indexes transactionally) |
| Garbage Collector | mark-and-sweep unreferenced chunks/packs after grace (11) | async | GC run state / watermarks |
| Integrity Scrubber | background read-verify checksums; detect & repair bitrot/loss (04) | async | scrub cursors, error log |
| Lifecycle Engine | evaluate lifecycle policies → emit tier/expiry/purge actions (10) | async | policy eval cursors |
| Tier / Migration Worker | move chunks/packs between tiers/providers (09, 10) | async | migration job state |
5. Data ownership (one writer per datum — the rule that enables extraction)
| Datum | Sole writer | Readers (via API/events) | Store |
|---|---|---|---|
| Node, Version (namespace) | File & Metadata context | Storage (read content_hash on commit/resolve) | Postgres |
| Object / Manifest | Storage Coordinator (on commit) | Coordinator, GC, Lifecycle | Postgres (Manifest Store) |
| Chunk Index row (refcount, location, state) | Coordinator (create/ref), Packer (relocate), GC (delete) — serialized via row-level state machine, not free-for-all | Coordinator, Scrubber, Migration | Postgres (Chunk Index) |
| Pack Index | Packer (create), GC (delete), Migration (relocate) | Coordinator (download resolve) | Postgres (Pack Index) |
| Chunk/Pack provider object (bytes) | Coordinator/client (write-once), Packer (write packs), GC (delete), Migration (copy) | download (presigned GET) | Object storage |
| Placement policy | Admin/Platform | Placement Service | Config/Postgres |
| Refcount | a derived, reconciled counter — see 11 GC §”refcount vs mark-sweep” for why it is a hint, not the GC authority | — | Postgres |
Critical ownership nuance: the Chunk Index row has multiple legitimate writers (create, relocate, delete) at different lifecycle stages. We serialize them through an explicit per-chunk state machine (
writing → committed → packed → unreferenced → deleting) guarded by compare-and-swap on the row, not by “one service owns the table.” This is the single most concurrency-sensitive datum in the subsystem and is treated as such in 11.
6. Key invariants (correctness — tested, not hoped)
- SI-1 (no orphan reference): a manifest is committed only after all its chunks are durably present and checksum-verified (05).
- SI-2 (no premature delete): a chunk’s bytes are deleted only after it is unreferenced by every manifest and a grace period has elapsed — closing the dedup-vs-GC race (11).
- SI-3 (content address is honest): stored bytes always hash to their name; violated bytes are quarantined and repaired (04).
- SI-4 (tenant containment): a chunk is only ever referenced by manifests in the tenant that created it (per-tenant dedup) (03, ADR-0018).
- SI-5 (rebuildable indexes): the Pack/Chunk location indexes can be rebuilt by scanning manifests + provider listings; loss of an index is an availability event, not a durability event.
Related ADRs
| ADR | Decision |
|---|---|
| 0016 | BLAKE3 for content & chunk addressing (SHA-256 compliance mode) |
| 0017 | Content-defined chunking (FastCDC) + packing |
| 0018 | Per-tenant, server-side deduplication |
| 0019 | Online mark-and-sweep GC with grace period |
| 0020 | Policy-driven placement & federation |
| 0021 | Dual resumable protocol (tus + provider multipart) |
Plus inherited: 0004, 0005, 0011, 0007, 0014.
References (research grounding)
- FastCDC (USENIX ATC ‘16) — content-defined chunking: https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf
- BLAKE3 — parallel Merkle-tree hash & verified streaming: https://github.com/BLAKE3-team/BLAKE3
- Harnik, Pinkas, Shulman-Peleg, “Side Channels in Cloud Services: Deduplication in Cloud Storage” (2010): https://eprint.iacr.org/2016/977.pdf
- AWS S3 multipart upload limits: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
- tus resumable upload protocol: https://tus.io/protocols/resumable-upload
- restic prune (mark-sweep over CDC chunks): https://restic.readthedocs.io/en/stable/060_forget.html
- Dropbox Magic Pocket (content-addressed blocks + 1 GiB buckets): https://dropbox.tech/infrastructure/inside-the-magic-pocket