08 — Metadata Architecture

Topic: metadata architecture. At millions of files the bytes are the easy part — object stores scale capacity effortlessly. The hard part is the metadata: the chunk index, manifests, and pack index that map logical files to physical bytes. This is the component that decides whether BitVault scales. Lose it and the bytes are unreadable noise; bottleneck it and the whole system stalls.


1. The metadata stores and their cardinality

Store Row ≈ Rows @ 1 PB logical Access pattern Hotness
Chunk Index one per unique chunk ~10^9 (@1 MiB chunks) point lookup by (tenant, chunk_hash); range scan for GC/scrub 🔥🔥🔥 hottest
Reference edges one per (chunk, manifest) ref ≥ chunk count (fan-out) insert on commit; scan/aggregate for GC 🔥🔥 insert-heavy
Manifest Store one per unique object ~ #versions point lookup by content_hash; immutable 🔥 read, cacheable
Pack Index one per pack + members ~10^6 packs lookup on download resolve 🔥 read
Namespace (nodes/versions) one per file/version ~10^7–10^8 tree/listing (File context, 08) separate concern

Metadata durability must be ≥ data durability. Bytes without manifests are unrecoverable. Postgres runs with synchronous replication + PITR; backups are tested by restore drills. (This is the one place we are more paranoid than the object store.)


2. Schema sketch (chunk index — the hot path)

erDiagram
    CHUNK ||--o{ REF_EDGE : "referenced-by"
    MANIFEST ||--o{ REF_EDGE : "references"
    PACK ||--o{ CHUNK : "contains"
    MANIFEST {
        bytea content_hash PK
        uuid tenant_id FK
        int chunk_count
        bigint logical_size
        timestamptz created_at
    }
    CHUNK {
        bytea chunk_hash PK
        uuid tenant_id FK
        int size
        uuid pack_id FK "null = standalone object"
        bigint pack_offset
        string provider_key "if standalone"
        string tier
        string state "writing|committed|packed|unref|deleting"
        string hash_algo
        string compression
        uuid enc_key_id
        timestamptz last_verified_at
        timestamptz unreferenced_at "grace clock for GC"
    }
    REF_EDGE {
        uuid tenant_id FK
        bytea chunk_hash FK
        bytea manifest_hash FK
    }
    PACK {
        uuid pack_id PK
        uuid tenant_id FK
        string provider_key
        string tier
        bigint size
        bigint live_bytes "for repack decisions"
    }

3. Access patterns (design for these, ignore the rest)

Operation Query Frequency Optimization
Dedup negotiate (03) SELECT exists chunk_hash IN (…) per tenant very high Bloom/cuckoo pre-filter + batched IN + Redis cache
Commit refs (05) insert manifest + edges + outbox in one TX high batch insert; one TX
Download resolve (06) manifest → chunks → pack locations high cache immutable manifests in Redis (infinite TTL)
GC mark/scan (11) range scan by tenant/state periodic partition pruning; index on (state, unreferenced_at)
Scrub walk (04) range scan by last_verified_at continuous walks index range, never provider List

The negotiate point-lookup and commit insert are the hot pair; everything is optimized so those stay fast under upload load.


4. Scaling roadmap (the honest growth path)

The chunk index is the thing that forces architectural change as data grows. We escalate only on demonstrated need (ADR-0001 discipline):

flowchart LR
    classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef m fill:#fde68a,stroke:#b45309,color:#111827;
    classDef l fill:#fed7aa,stroke:#c2410c,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    a["Stage 1: single Postgres<br/>(+ read replicas)"]:::s --> b["Stage 2: declarative partitioning<br/>by hash(tenant_id)"]:::m
    b --> c["Stage 3: sharded Postgres<br/>(Citus / app-level shards)"]:::l
    c --> d["Stage 4: scale-out KV for chunk index<br/>(FoundationDB / ScyllaDB)"]:::x
Stage Trigger What changes Cost
1. Single PG + replicas start one cluster; replicas absorb reads low ops
2. Declarative partitioning one table too large to vacuum/index well hash-partition by tenant; partition pruning medium
3. Sharded Postgres write IOPS / connection ceiling on one primary Citus or app-routed shards keyed by tenant higher ops, keeps SQL/TX
4. Dedicated KV store chunk index outgrows shareable Postgres move only the chunk index to FoundationDB (has ACID txns) or ScyllaDB big change; isolate to one component

Why this order: Postgres carries us far (partitioned + sharded) while keeping the transactional commit protocol (manifest + edges + outbox atomic, ADR-0004/0006). Stage 4 moves only the chunk index — the one table whose cardinality (10^9+) eventually outgrows even sharded Postgres — to a store built for it; FoundationDB is preferred over Cassandra/Scylla precisely because it keeps ACID transactions, so the commit invariant survives. The manifest/pack stores stay in Postgres.


5. Caching & filters (keeping the hot path off the database)


6. Tradeoffs / Alternatives / Scaling concerns

Tradeoffs. Keeping metadata in Postgres buys transactions, relational integrity, RLS isolation, and operational familiarity, at the cost of an eventual sharding/ migration project for the chunk index. We judge that worth it — transactions are what make the commit + GC invariants tractable (the alternative is hand-rolled distributed consistency, which is where storage systems get data-loss bugs).

Alternatives considered.

Scaling concerns (summary).

References