ADR-0006 — NATS JetStream event backbone + transactional outbox
V1 Freeze (2026-06-12): Accepted (tiered). The transactional outbox + in-process bus interface are V1. The NATS JetStream implementation is deferred to P3 (swap behind the bus interface, no rewrite).
Context
BitVault is event-driven: namespace changes must fan out to the sync journal, search index, notifications, metering, and audit (04). Two hazards: (1) the dual-write between the DB and the broker — if we commit a DB transaction then publish, a crash in between loses the event; publish-then-commit can emit phantom events (R2/R6); (2) premature async — using a broker for what are in-process calls in the v1 monolith is overengineering (Ledger).
Decision
- Transactional outbox is the only event source. Domain mutations write an
outboxrow in the same Postgres transaction as the state change (ADR-0004). A drainer publishes unpublished rows and marks them published. This guarantees at-least-once delivery with no lost/phantom events. - An event-bus interface (
internal/platform/bus) abstracts publish/subscribe.- v1 (monolith): an in-process implementation — the drainer dispatches to in-process subscribers. No external broker required (Ledger).
- P3 onward: a NATS JetStream implementation — the drainer publishes to
subjects (
node.*,blob.*,share.*); consumers are durable JetStream consumers. Swapping is a config change, not a rewrite (ADR-0001).
- Consumers are idempotent (dedup on event id), per-aggregate ordered, and have dead-letter handling for poison messages. Derived stores are reconstructible by replay (I6).
Consequences
Positive
- Closes the DB⇄broker dual-write hole; exactly the same correctness story as the object-store commit protocol (R2).
- The async plane can fail/scale independently without blocking the control plane (NFR-5).
- v1 stays simple (no broker to operate) while the code is already event-shaped, so introducing NATS later is low-risk.
Negative / costs
- At-least-once forces idempotent consumers everywhere (real work, enforced by design + tests).
- Eventual consistency leaks to the UI (a file is committed before it’s searchable) — the UI must show “processing…” states (NFR-5).
- A poison-message DLQ + redrive tooling must exist.
- The outbox drainer is a component to monitor (lag = consumer-lag metric, NFR-7).
Alternatives considered
- Publish directly from app code (no outbox): rejected — reintroduces the lost/phantom-event dual-write bug.
- Change Data Capture (e.g. Debezium on the WAL): powerful but operationally heavy and deferred for v1; the outbox is simpler, self-host-friendly, and sufficient. Revisit if we need DB-agnostic capture at scale.
- Kafka instead of NATS: rejected for this scale/footprint — NATS JetStream is lighter to operate and self-host, fits the tiered-dependency goal (ADR-0012), and meets ordering/durability needs. Revisit only if Kafka-specific ecosystem is needed.
- Broker from day one (no in-proc bus): rejected — overengineering for the v1 monolith (Ledger); the interface gives us the swap for free.