Data Model

The schema encodes three decisions that underpin every correctness guarantee:

  1. Node vs Blob separation — the namespace (NODE/VERSION) and the bytes (BLOB) are distinct. A namespace mutation never touches object storage. This is the dual-write firewall.
  2. Content-addressed, ref-counted blobsBLOB.content_hash is the identity. Deduplication is free; GC is safe (refcount = 0 → eligible for deletion). Per-tenant scope prevents cross-tenant side-channels (ADR-0018).
  3. tenant_id on every owned row — Row-Level Security policies in Postgres filter by session tenant context. A forgotten WHERE tenant_id = ? in application code cannot leak across tenants; RLS is the boundary, app-layer scoping is defense-in-depth (ADR-0007).

See System Overview for how these entities move through the upload, sync, and event flows.


Entity-Relationship Diagram

erDiagram
    TENANT ||--o{ USER : "has"
    TENANT ||--o{ NODE : "scopes"
    TENANT ||--o{ BLOB : "scopes"
    TENANT ||--o{ CHANGE : "scopes"
    TENANT ||--|| QUOTA : "has"

    USER ||--o{ MEMBERSHIP : "has"
    USER ||--o{ API_TOKEN : "owns"
    USER ||--o{ DEVICE : "registers"
    USER ||--o{ NODE : "owns"

    NODE ||--o{ NODE : "parent-of"
    NODE ||--o{ VERSION : "has"
    NODE ||--o{ NODE_METADATA : "tagged-by"
    NODE ||--o{ SHARE : "shared-via"
    NODE ||--o{ CHANGE : "journaled-in"

    VERSION }o--|| BLOB : "references"
    BLOB ||--o{ MULTIPART_UPLOAD : "assembled-by"

    SHARE }o--|| USER : "granted-to"
    DEVICE ||--o{ CHANGE : "consumes-via-cursor"

    TENANT {
        uuid id PK
        string name
        string plan
        string status
        timestamptz created_at
    }
    USER {
        uuid id PK
        uuid tenant_id FK
        string email
        string password_hash
        string status
    }
    MEMBERSHIP {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        string role
    }
    API_TOKEN {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        string token_hash
        timestamptz expires_at
        timestamptz revoked_at
    }
    NODE {
        uuid id PK
        uuid tenant_id FK
        uuid parent_id FK
        string type "file|folder"
        string name
        string path "materialized path"
        uuid owner_id FK
        uuid current_version_id FK
        timestamptz deleted_at "soft-delete"
    }
    VERSION {
        uuid id PK
        uuid tenant_id FK
        uuid node_id FK
        string content_hash FK
        bigint size
        string mime
        uuid created_by FK
        timestamptz created_at
    }
    BLOB {
        string content_hash PK "BLAKE3"
        uuid tenant_id FK
        string provider "minio|s3|r2|gcs|azure"
        string bucket
        string object_key
        bigint size
        int refcount
        string state "staging|committed"
    }
    MULTIPART_UPLOAD {
        uuid id PK
        uuid tenant_id FK
        string upload_id "provider id"
        string staging_key
        bigint declared_size
        timestamptz expires_at
    }
    NODE_METADATA {
        uuid id PK
        uuid tenant_id FK
        uuid node_id FK
        string key
        string value
    }
    SHARE {
        uuid id PK
        uuid tenant_id FK
        uuid node_id FK
        string kind "grant|link"
        uuid grantee_id FK
        string permission "view|edit|manage"
        string link_token
        string password_hash
        timestamptz expires_at
        int max_downloads
    }
    DEVICE {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        string name
        bigint sync_cursor
    }
    CHANGE {
        bigint seq PK "monotonic per tenant"
        uuid tenant_id FK
        uuid node_id FK
        string op "create|update|move|delete"
        uuid version_id FK
        uuid actor_id FK
        timestamptz at
    }
    QUOTA {
        uuid tenant_id PK
        bigint bytes_limit
        bigint bytes_used
    }
    OUTBOX {
        uuid id PK
        uuid tenant_id FK
        string aggregate
        string event_type
        jsonb payload
        timestamptz published_at "null until published"
    }

OUTBOX is intentionally not linked by foreign key — it is written in the same transaction as the aggregate change (e.g. a VERSION insert) and drained asynchronously. It is the bridge from strong consistency to the event plane.


Key Invariants

These are correctness properties, not style guidelines. Each should be covered by a test.

ID Invariant Consequence if violated
I1 A VERSION may reference a BLOB only after bytes are HEAD-verified (size/etag/BLAKE3 hash). The VERSION insert, BLOB.refcount++, and the OUTBOX row are one transaction. Committed metadata with no bytes → phantom file. Transaction failure leaves no version; staging blob GC’d.
I2 BLOB.content_hash is the identity (BLAKE3, ADR-0016). Uploading identical bytes within a tenant increments refcount, not storage. GC deletes the object only when refcount = 0. Storing the same bytes twice wastes storage. Premature deletion with live references corrupts files.
I3 Every owned row carries tenant_id. Postgres RLS policies filter by session tenant context. App-layer WHERE tenant_id = ? is defense-in-depth, not the boundary. Cross-tenant data leak. RLS is the last-resort guard (ADR-0007).
I4 CHANGE.seq is a per-tenant monotonic counter. Device delta pull is WHERE tenant_id = ? AND seq > cursor ORDER BY seq. Non-monotonic sequence breaks delta sync and cursor advancement. Journal is the sync spine (ADR-0024).
I5 Move/rename/copy mutate NODE rows and path, never BLOB/objects. Copy creates a new NODE/VERSION referencing the same BLOB (refcount++). Byte copies on rename waste storage and defeat content addressing.
I6 OpenSearch documents, notification state, and usage meters are projections of CHANGE/events. They are never the source of truth and can be rebuilt by replaying the journal. Treating a derived store as authoritative leads to split-brain when it lags or is rebuilt.

:::warning Outbox Transactional Rule OUTBOX rows must be written in the same transaction as the aggregate change. Never write aggregate + outbox in separate transactions. Splitting them creates a window where the event is lost if the process crashes after the aggregate commit but before the outbox insert — the classic dual-write bug this design is built to prevent. :::


Storage Strategy per Layer

Store Role Authoritative? Rebuildable?
PostgreSQL Metadata source of truth: NODE, VERSION, BLOB refs, CHANGE journal, OUTBOX, quotas, audit log. Forward-only migrations (expand/contract, ADR-0004). Yes N/A (primary)
Redis Ephemeral hot state: sessions, authz cache (short TTL), distributed locks, rate-limit counters, BFF aggregation cache. Never authoritative. No Yes (reconstructible from Postgres)
NATS JetStream Event transport. Receives events from the outbox drainer; delivers to consumers. At-least-once; consumer acks advance the subject cursor. Durable subjects with DLQ. (ADR-0006) No (outbox is the durable source) Yes (replay from outbox)
OpenSearch Derived full-text and metadata search index. Postgres-FTS is the fallback when OpenSearch is unavailable (ADR-0009). No Yes (replay NodeChanged events)
Object Storage Blob bytes: MinIO / S3 / R2 / GCS / Azure behind a provider interface (ADR-0005). Addressed by BLAKE3(content), tenant-prefixed keys. Never traverses compute (ADR-0011). Yes (for bytes) No (source data)

Multi-Tenancy

BitVault uses a shared database, shared schema model in v1 with RLS as the isolation boundary (ADR-0007).

Escalation path (documented, not built in v1):

Load Strategy
Standard tenants Shared DB + shared schema + RLS (v1 default)
Noisy / high-volume tenant Schema-per-tenant (routing change only; tenant_id-everywhere makes this transparent to queries)
Enterprise / data-residency tenant Database-per-tenant (same routing change; same query semantics)

The tenant_id-on-every-row invariant means the escalation path is a routing change, not a rewrite.


Next: Service Boundaries · Back: System Overview