08 — Metadata Architecture
Topic: metadata architecture. At millions of files the bytes are the easy part — object stores scale capacity effortlessly. The hard part is the metadata: the chunk index, manifests, and pack index that map logical files to physical bytes. This is the component that decides whether BitVault scales. Lose it and the bytes are unreadable noise; bottleneck it and the whole system stalls.
1. The metadata stores and their cardinality
| Store | Row ≈ | Rows @ 1 PB logical | Access pattern | Hotness |
|---|---|---|---|---|
| Chunk Index | one per unique chunk | ~10^9 (@1 MiB chunks) | point lookup by (tenant, chunk_hash); range scan for GC/scrub |
🔥🔥🔥 hottest |
| Reference edges | one per (chunk, manifest) ref | ≥ chunk count (fan-out) | insert on commit; scan/aggregate for GC | 🔥🔥 insert-heavy |
| Manifest Store | one per unique object | ~ #versions | point lookup by content_hash; immutable |
🔥 read, cacheable |
| Pack Index | one per pack + members | ~10^6 packs | lookup on download resolve | 🔥 read |
| Namespace (nodes/versions) | one per file/version | ~10^7–10^8 | tree/listing (File context, 08) | separate concern |
Metadata durability must be ≥ data durability. Bytes without manifests are unrecoverable. Postgres runs with synchronous replication + PITR; backups are tested by restore drills. (This is the one place we are more paranoid than the object store.)
2. Schema sketch (chunk index — the hot path)
erDiagram
CHUNK ||--o{ REF_EDGE : "referenced-by"
MANIFEST ||--o{ REF_EDGE : "references"
PACK ||--o{ CHUNK : "contains"
MANIFEST {
bytea content_hash PK
uuid tenant_id FK
int chunk_count
bigint logical_size
timestamptz created_at
}
CHUNK {
bytea chunk_hash PK
uuid tenant_id FK
int size
uuid pack_id FK "null = standalone object"
bigint pack_offset
string provider_key "if standalone"
string tier
string state "writing|committed|packed|unref|deleting"
string hash_algo
string compression
uuid enc_key_id
timestamptz last_verified_at
timestamptz unreferenced_at "grace clock for GC"
}
REF_EDGE {
uuid tenant_id FK
bytea chunk_hash FK
bytea manifest_hash FK
}
PACK {
uuid pack_id PK
uuid tenant_id FK
string provider_key
string tier
bigint size
bigint live_bytes "for repack decisions"
}
REF_EDGEinstead of a mutable counter turns reference accounting into inserts (no hot-counter contention, 03 §4); the edge set is also the GC mark input (11).state+unreferenced_atdrive the per-chunk lifecycle state machine and the GC grace clock (SI-2).- Composite PK / partition key is
(tenant_id, chunk_hash)→ tenant isolation + even hash distribution.
3. Access patterns (design for these, ignore the rest)
| Operation | Query | Frequency | Optimization |
|---|---|---|---|
| Dedup negotiate (03) | SELECT exists chunk_hash IN (…) per tenant |
very high | Bloom/cuckoo pre-filter + batched IN + Redis cache |
| Commit refs (05) | insert manifest + edges + outbox in one TX | high | batch insert; one TX |
| Download resolve (06) | manifest → chunks → pack locations | high | cache immutable manifests in Redis (infinite TTL) |
| GC mark/scan (11) | range scan by tenant/state | periodic | partition pruning; index on (state, unreferenced_at) |
| Scrub walk (04) | range scan by last_verified_at |
continuous | walks index range, never provider List |
The negotiate point-lookup and commit insert are the hot pair; everything is optimized so those stay fast under upload load.
4. Scaling roadmap (the honest growth path)
The chunk index is the thing that forces architectural change as data grows. We escalate only on demonstrated need (ADR-0001 discipline):
flowchart LR
classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
classDef m fill:#fde68a,stroke:#b45309,color:#111827;
classDef l fill:#fed7aa,stroke:#c2410c,color:#111827;
classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
a["Stage 1: single Postgres<br/>(+ read replicas)"]:::s --> b["Stage 2: declarative partitioning<br/>by hash(tenant_id)"]:::m
b --> c["Stage 3: sharded Postgres<br/>(Citus / app-level shards)"]:::l
c --> d["Stage 4: scale-out KV for chunk index<br/>(FoundationDB / ScyllaDB)"]:::x
| Stage | Trigger | What changes | Cost |
|---|---|---|---|
| 1. Single PG + replicas | start | one cluster; replicas absorb reads | low ops |
| 2. Declarative partitioning | one table too large to vacuum/index well | hash-partition by tenant; partition pruning | medium |
| 3. Sharded Postgres | write IOPS / connection ceiling on one primary | Citus or app-routed shards keyed by tenant | higher ops, keeps SQL/TX |
| 4. Dedicated KV store | chunk index outgrows shareable Postgres | move only the chunk index to FoundationDB (has ACID txns) or ScyllaDB | big change; isolate to one component |
Why this order: Postgres carries us far (partitioned + sharded) while keeping the transactional commit protocol (manifest + edges + outbox atomic, ADR-0004/0006). Stage 4 moves only the chunk index — the one table whose cardinality (10^9+) eventually outgrows even sharded Postgres — to a store built for it; FoundationDB is preferred over Cassandra/Scylla precisely because it keeps ACID transactions, so the commit invariant survives. The manifest/pack stores stay in Postgres.
5. Caching & filters (keeping the hot path off the database)
- Redis cache for manifest resolution (immutable → safe, infinite TTL) and for recent negotiate answers (short TTL).
- Per-tenant Bloom/cuckoo filter of chunk hashes, maintained incrementally, to answer “definitely new” without an index read — cutting the dominant negotiate read load. False positives fall through to the index (correctness preserved); false negatives impossible (so we never wrongly skip an upload).
- Connection discipline: the chunk index is touched on every chunk; pooling + prepared statements + batched IN are mandatory, not optional.
6. Tradeoffs / Alternatives / Scaling concerns
Tradeoffs. Keeping metadata in Postgres buys transactions, relational integrity, RLS isolation, and operational familiarity, at the cost of an eventual sharding/ migration project for the chunk index. We judge that worth it — transactions are what make the commit + GC invariants tractable (the alternative is hand-rolled distributed consistency, which is where storage systems get data-loss bugs).
Alternatives considered.
- Store metadata in the object store itself (sidecar files / tags): no transactions, no efficient point lookup or range scan, weak/again-eventual — a non-starter for a hot index (and the dual-write problem returns). Rejected.
- Go straight to Cassandra/DynamoDB for everything: scales writes but loses multi-row transactions → the atomic commit (manifest + refs + outbox) and safe GC become very hard, and RLS isolation is lost. Rejected as the default; reserved for the chunk index at Stage 4, and even then we prefer transactional FoundationDB.
- Embed chunk lists on the version row (no separate manifest/chunk index): simple at toy scale, but the namespace table becomes enormous and dedup of identical manifests is lost. Rejected.
Scaling concerns (summary).
- Cardinality (10^9 chunks) is the master constraint → coarse CDC + packing reduce row count; partitioning/sharding distribute it; FoundationDB is the escape hatch.
- Write amplification on commit (manifest + N edges + outbox) → batch in one TX; edges are append-only.
- GC/scrub scans must not contend with the hot path → run on replicas where possible, partition-pruned, throttled.
- Metadata loss is catastrophic → synchronous replication, PITR, restore drills, and SI-5 (indexes rebuildable from manifests + listings as the last-resort backstop).
References
- PostgreSQL declarative partitioning: https://www.postgresql.org/docs/current/ddl-partitioning.html
- Citus distributed Postgres: https://www.citusdata.com/
- FoundationDB (ACID at scale): https://apple.github.io/foundationdb/
- Dropbox metadata scaling (Edgestore/Panda): https://dropbox.tech/infrastructure