Data Model
The schema encodes three decisions that underpin every correctness guarantee:
- Node vs Blob separation — the namespace (
NODE/VERSION) and the bytes (BLOB) are distinct. A namespace mutation never touches object storage. This is the dual-write firewall. - Content-addressed, ref-counted blobs —
BLOB.content_hashis the identity. Deduplication is free; GC is safe (refcount = 0→ eligible for deletion). Per-tenant scope prevents cross-tenant side-channels (ADR-0018). tenant_idon every owned row — Row-Level Security policies in Postgres filter by session tenant context. A forgottenWHERE tenant_id = ?in application code cannot leak across tenants; RLS is the boundary, app-layer scoping is defense-in-depth (ADR-0007).
See System Overview for how these entities move through the upload, sync, and event flows.
Entity-Relationship Diagram
erDiagram
TENANT ||--o{ USER : "has"
TENANT ||--o{ NODE : "scopes"
TENANT ||--o{ BLOB : "scopes"
TENANT ||--o{ CHANGE : "scopes"
TENANT ||--|| QUOTA : "has"
USER ||--o{ MEMBERSHIP : "has"
USER ||--o{ API_TOKEN : "owns"
USER ||--o{ DEVICE : "registers"
USER ||--o{ NODE : "owns"
NODE ||--o{ NODE : "parent-of"
NODE ||--o{ VERSION : "has"
NODE ||--o{ NODE_METADATA : "tagged-by"
NODE ||--o{ SHARE : "shared-via"
NODE ||--o{ CHANGE : "journaled-in"
VERSION }o--|| BLOB : "references"
BLOB ||--o{ MULTIPART_UPLOAD : "assembled-by"
SHARE }o--|| USER : "granted-to"
DEVICE ||--o{ CHANGE : "consumes-via-cursor"
TENANT {
uuid id PK
string name
string plan
string status
timestamptz created_at
}
USER {
uuid id PK
uuid tenant_id FK
string email
string password_hash
string status
}
MEMBERSHIP {
uuid id PK
uuid tenant_id FK
uuid user_id FK
string role
}
API_TOKEN {
uuid id PK
uuid tenant_id FK
uuid user_id FK
string token_hash
timestamptz expires_at
timestamptz revoked_at
}
NODE {
uuid id PK
uuid tenant_id FK
uuid parent_id FK
string type "file|folder"
string name
string path "materialized path"
uuid owner_id FK
uuid current_version_id FK
timestamptz deleted_at "soft-delete"
}
VERSION {
uuid id PK
uuid tenant_id FK
uuid node_id FK
string content_hash FK
bigint size
string mime
uuid created_by FK
timestamptz created_at
}
BLOB {
string content_hash PK "BLAKE3"
uuid tenant_id FK
string provider "minio|s3|r2|gcs|azure"
string bucket
string object_key
bigint size
int refcount
string state "staging|committed"
}
MULTIPART_UPLOAD {
uuid id PK
uuid tenant_id FK
string upload_id "provider id"
string staging_key
bigint declared_size
timestamptz expires_at
}
NODE_METADATA {
uuid id PK
uuid tenant_id FK
uuid node_id FK
string key
string value
}
SHARE {
uuid id PK
uuid tenant_id FK
uuid node_id FK
string kind "grant|link"
uuid grantee_id FK
string permission "view|edit|manage"
string link_token
string password_hash
timestamptz expires_at
int max_downloads
}
DEVICE {
uuid id PK
uuid tenant_id FK
uuid user_id FK
string name
bigint sync_cursor
}
CHANGE {
bigint seq PK "monotonic per tenant"
uuid tenant_id FK
uuid node_id FK
string op "create|update|move|delete"
uuid version_id FK
uuid actor_id FK
timestamptz at
}
QUOTA {
uuid tenant_id PK
bigint bytes_limit
bigint bytes_used
}
OUTBOX {
uuid id PK
uuid tenant_id FK
string aggregate
string event_type
jsonb payload
timestamptz published_at "null until published"
}
OUTBOX is intentionally not linked by foreign key — it is written in the same transaction as the aggregate change (e.g. a VERSION insert) and drained asynchronously. It is the bridge from strong consistency to the event plane.
Key Invariants
These are correctness properties, not style guidelines. Each should be covered by a test.
| ID | Invariant | Consequence if violated |
|---|---|---|
| I1 | A VERSION may reference a BLOB only after bytes are HEAD-verified (size/etag/BLAKE3 hash). The VERSION insert, BLOB.refcount++, and the OUTBOX row are one transaction. |
Committed metadata with no bytes → phantom file. Transaction failure leaves no version; staging blob GC’d. |
| I2 | BLOB.content_hash is the identity (BLAKE3, ADR-0016). Uploading identical bytes within a tenant increments refcount, not storage. GC deletes the object only when refcount = 0. |
Storing the same bytes twice wastes storage. Premature deletion with live references corrupts files. |
| I3 | Every owned row carries tenant_id. Postgres RLS policies filter by session tenant context. App-layer WHERE tenant_id = ? is defense-in-depth, not the boundary. |
Cross-tenant data leak. RLS is the last-resort guard (ADR-0007). |
| I4 | CHANGE.seq is a per-tenant monotonic counter. Device delta pull is WHERE tenant_id = ? AND seq > cursor ORDER BY seq. |
Non-monotonic sequence breaks delta sync and cursor advancement. Journal is the sync spine (ADR-0024). |
| I5 | Move/rename/copy mutate NODE rows and path, never BLOB/objects. Copy creates a new NODE/VERSION referencing the same BLOB (refcount++). |
Byte copies on rename waste storage and defeat content addressing. |
| I6 | OpenSearch documents, notification state, and usage meters are projections of CHANGE/events. They are never the source of truth and can be rebuilt by replaying the journal. |
Treating a derived store as authoritative leads to split-brain when it lags or is rebuilt. |
:::warning Outbox Transactional Rule
OUTBOX rows must be written in the same transaction as the aggregate change. Never write aggregate + outbox in separate transactions. Splitting them creates a window where the event is lost if the process crashes after the aggregate commit but before the outbox insert — the classic dual-write bug this design is built to prevent.
:::
Storage Strategy per Layer
| Store | Role | Authoritative? | Rebuildable? |
|---|---|---|---|
| PostgreSQL | Metadata source of truth: NODE, VERSION, BLOB refs, CHANGE journal, OUTBOX, quotas, audit log. Forward-only migrations (expand/contract, ADR-0004). |
Yes | N/A (primary) |
| Redis | Ephemeral hot state: sessions, authz cache (short TTL), distributed locks, rate-limit counters, BFF aggregation cache. Never authoritative. | No | Yes (reconstructible from Postgres) |
| NATS JetStream | Event transport. Receives events from the outbox drainer; delivers to consumers. At-least-once; consumer acks advance the subject cursor. Durable subjects with DLQ. (ADR-0006) | No (outbox is the durable source) | Yes (replay from outbox) |
| OpenSearch | Derived full-text and metadata search index. Postgres-FTS is the fallback when OpenSearch is unavailable (ADR-0009). | No | Yes (replay NodeChanged events) |
| Object Storage | Blob bytes: MinIO / S3 / R2 / GCS / Azure behind a provider interface (ADR-0005). Addressed by BLAKE3(content), tenant-prefixed keys. Never traverses compute (ADR-0011). |
Yes (for bytes) | No (source data) |
Multi-Tenancy
BitVault uses a shared database, shared schema model in v1 with RLS as the isolation boundary (ADR-0007).
- Every owned row carries
tenant_id(enforced at insert via triggers; readable via RLS policy). - Postgres session variable
app.current_tenant_idis set by the connection pool on checkout; RLS policies filter all reads and writes by this value. - Application-layer
WHERE tenant_id = ?clauses are defense-in-depth, not the primary guard. - Object storage keys are tenant-prefixed (
/{tenant_id}/{content_hash}) so storage-side isolation and per-tenant lifecycle/quota policies work regardless of DB topology.
Escalation path (documented, not built in v1):
| Load | Strategy |
|---|---|
| Standard tenants | Shared DB + shared schema + RLS (v1 default) |
| Noisy / high-volume tenant | Schema-per-tenant (routing change only; tenant_id-everywhere makes this transparent to queries) |
| Enterprise / data-residency tenant | Database-per-tenant (same routing change; same query semantics) |
The tenant_id-on-every-row invariant means the escalation path is a routing change, not a rewrite.
Next: Service Boundaries · Back: System Overview