08 — Metadata Architecture

Topic: metadata architecture. At millions of files the bytes are the easy part — object stores scale capacity effortlessly. The hard part is the metadata: the chunk index, manifests, and pack index that map logical files to physical bytes. This is the component that decides whether BitVault scales. Lose it and the bytes are unreadable noise; bottleneck it and the whole system stalls.

1. The metadata stores and their cardinality

Store	Row ≈	Rows @ 1 PB logical	Access pattern	Hotness
Chunk Index	one per unique chunk	~10^9 (@1 MiB chunks)	point lookup by `(tenant, chunk_hash)`; range scan for GC/scrub	🔥🔥🔥 hottest
Reference edges	one per (chunk, manifest) ref	≥ chunk count (fan-out)	insert on commit; scan/aggregate for GC	🔥🔥 insert-heavy
Manifest Store	one per unique object	~ #versions	point lookup by `content_hash`; immutable	🔥 read, cacheable
Pack Index	one per pack + members	~10^6 packs	lookup on download resolve	🔥 read
Namespace (nodes/versions)	one per file/version	~10^7–10^8	tree/listing (File context, 08)	separate concern

Metadata durability must be ≥ data durability. Bytes without manifests are unrecoverable. Postgres runs with synchronous replication + PITR; backups are tested by restore drills. (This is the one place we are more paranoid than the object store.)

2. Schema sketch (chunk index — the hot path)

erDiagram
    CHUNK ||--o{ REF_EDGE : "referenced-by"
    MANIFEST ||--o{ REF_EDGE : "references"
    PACK ||--o{ CHUNK : "contains"
    MANIFEST {
        bytea content_hash PK
        uuid tenant_id FK
        int chunk_count
        bigint logical_size
        timestamptz created_at
    }
    CHUNK {
        bytea chunk_hash PK
        uuid tenant_id FK
        int size
        uuid pack_id FK "null = standalone object"
        bigint pack_offset
        string provider_key "if standalone"
        string tier
        string state "writing|committed|packed|unref|deleting"
        string hash_algo
        string compression
        uuid enc_key_id
        timestamptz last_verified_at
        timestamptz unreferenced_at "grace clock for GC"
    }
    REF_EDGE {
        uuid tenant_id FK
        bytea chunk_hash FK
        bytea manifest_hash FK
    }
    PACK {
        uuid pack_id PK
        uuid tenant_id FK
        string provider_key
        string tier
        bigint size
        bigint live_bytes "for repack decisions"
    }

REF_EDGE instead of a mutable counter turns reference accounting into inserts (no hot-counter contention, 03 §4); the edge set is also the GC mark input (11).
state + unreferenced_at drive the per-chunk lifecycle state machine and the GC grace clock (SI-2).
Composite PK / partition key is (tenant_id, chunk_hash) → tenant isolation + even hash distribution.

3. Access patterns (design for these, ignore the rest)

Operation	Query	Frequency	Optimization
Dedup negotiate (03)	`SELECT exists chunk_hash IN (…)` per tenant	very high	Bloom/cuckoo pre-filter + batched IN + Redis cache
Commit refs (05)	insert manifest + edges + outbox in one TX	high	batch insert; one TX
Download resolve (06)	manifest → chunks → pack locations	high	cache immutable manifests in Redis (infinite TTL)
GC mark/scan (11)	range scan by tenant/state	periodic	partition pruning; index on `(state, unreferenced_at)`
Scrub walk (04)	range scan by `last_verified_at`	continuous	walks index range, never provider `List`

The negotiate point-lookup and commit insert are the hot pair; everything is optimized so those stay fast under upload load.

4. Scaling roadmap (the honest growth path)

The chunk index is the thing that forces architectural change as data grows. We escalate only on demonstrated need (ADR-0001 discipline):

flowchart LR
    classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef m fill:#fde68a,stroke:#b45309,color:#111827;
    classDef l fill:#fed7aa,stroke:#c2410c,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    a["Stage 1: single Postgres<br/>(+ read replicas)"]:::s --> b["Stage 2: declarative partitioning<br/>by hash(tenant_id)"]:::m
    b --> c["Stage 3: sharded Postgres<br/>(Citus / app-level shards)"]:::l
    c --> d["Stage 4: scale-out KV for chunk index<br/>(FoundationDB / ScyllaDB)"]:::x

Stage	Trigger	What changes	Cost
1. Single PG + replicas	start	one cluster; replicas absorb reads	low ops
2. Declarative partitioning	one table too large to vacuum/index well	hash-partition by tenant; partition pruning	medium
3. Sharded Postgres	write IOPS / connection ceiling on one primary	Citus or app-routed shards keyed by tenant	higher ops, keeps SQL/TX
4. Dedicated KV store	chunk index outgrows shareable Postgres	move only the chunk index to FoundationDB (has ACID txns) or ScyllaDB	big change; isolate to one component

Why this order: Postgres carries us far (partitioned + sharded) while keeping the transactional commit protocol (manifest + edges + outbox atomic, ADR-0004/0006). Stage 4 moves only the chunk index — the one table whose cardinality (10^9+) eventually outgrows even sharded Postgres — to a store built for it; FoundationDB is preferred over Cassandra/Scylla precisely because it keeps ACID transactions, so the commit invariant survives. The manifest/pack stores stay in Postgres.

5. Caching & filters (keeping the hot path off the database)

Redis cache for manifest resolution (immutable → safe, infinite TTL) and for recent negotiate answers (short TTL).
Per-tenant Bloom/cuckoo filter of chunk hashes, maintained incrementally, to answer “definitely new” without an index read — cutting the dominant negotiate read load. False positives fall through to the index (correctness preserved); false negatives impossible (so we never wrongly skip an upload).
Connection discipline: the chunk index is touched on every chunk; pooling + prepared statements + batched IN are mandatory, not optional.

6. Tradeoffs / Alternatives / Scaling concerns

Tradeoffs. Keeping metadata in Postgres buys transactions, relational integrity, RLS isolation, and operational familiarity, at the cost of an eventual sharding/ migration project for the chunk index. We judge that worth it — transactions are what make the commit + GC invariants tractable (the alternative is hand-rolled distributed consistency, which is where storage systems get data-loss bugs).

Alternatives considered.

Store metadata in the object store itself (sidecar files / tags): no transactions, no efficient point lookup or range scan, weak/again-eventual — a non-starter for a hot index (and the dual-write problem returns). Rejected.
Go straight to Cassandra/DynamoDB for everything: scales writes but loses multi-row transactions → the atomic commit (manifest + refs + outbox) and safe GC become very hard, and RLS isolation is lost. Rejected as the default; reserved for the chunk index at Stage 4, and even then we prefer transactional FoundationDB.
Embed chunk lists on the version row (no separate manifest/chunk index): simple at toy scale, but the namespace table becomes enormous and dedup of identical manifests is lost. Rejected.

Scaling concerns (summary).

Cardinality (10^9 chunks) is the master constraint → coarse CDC + packing reduce row count; partitioning/sharding distribute it; FoundationDB is the escape hatch.
Write amplification on commit (manifest + N edges + outbox) → batch in one TX; edges are append-only.
GC/scrub scans must not contend with the hot path → run on replicas where possible, partition-pruned, throttled.
Metadata loss is catastrophic → synchronous replication, PITR, restore drills, and SI-5 (indexes rebuildable from manifests + listings as the last-resort backstop).

References

PostgreSQL declarative partitioning: https://www.postgresql.org/docs/current/ddl-partitioning.html
Citus distributed Postgres: https://www.citusdata.com/
FoundationDB (ACID at scale): https://apple.github.io/foundationdb/
Dropbox metadata scaling (Edgestore/Panda): https://dropbox.tech/infrastructure