ADR-0013 — OpenTelemetry for traces, metrics, and logs from day one

Status: Accepted
Date: 2026-06-11
Related: 01 R10, 03 NFR-7, 09 P0

Context

An event-driven system that will become distributed is undebuggable without correlation across REST → gRPC → NATS → workers (R10). Observability is frequently bolted on late, after the patterns that make it useful are missing. For BitVault it is also an explicit portfolio competency. The async/eventual-consistency design (ADR-0006) makes “where did this file’s indexing go?” a routine question that only tracing can answer cheaply.

Decision

Adopt OpenTelemetry (OTel) from commit #1 (P0) — vendor-neutral SDK for traces, metrics, and logs across every component.
One trace per user action. A trace/correlation ID is minted at the gateway and propagated through gRPC metadata and NATS message headers, so a single upload is one connected trace spanning gateway → File → Storage → bus → indexer (NFR-7 acceptance).
Metrics: RED (Rate/Errors/Duration) per endpoint and gRPC method; USE for infra; domain gauges for outbox lag, consumer lag, search-index lag, GC backlog — the eventual-consistency health signals (ADR-0006/0009).
Logs: structured JSON, correlated by trace ID; no PII in logs.
Health: /healthz (liveness) and /readyz (readiness, incl. dependency checks) on every component; graceful shutdown drains in-flight work.
Export to an OTel Collector; backends are deployment choices (e.g. Prometheus + Tempo/Jaeger + Loki, or a vendor) — not hard-coded.
SLOs + error budgets defined for the NFR-3 latency and NFR-1 availability targets, with alerts on burn rate.

Consequences

Positive

Distributed debugging is possible before the system is distributed — extraction (P4) is safe because we can prove behavior with traces before and after.
Eventual-consistency lag is measured, not guessed (outbox/consumer/index lag).
Vendor-neutral: self-host and SaaS pick their own backends behind the Collector.
Satisfies a portfolio goal with a real, load-bearing implementation (not a demo).

Negative / costs

Instrumentation is upfront effort on every code path and a small runtime overhead (sampling tunes cost). Justified: it is cheaper than debugging blind later.
Trace context propagation through NATS requires disciplined header handling (centralized in internal/platform/bus).
An observability backend is another thing to run; for lite self-host it can be minimal/optional (logs + basic metrics), full stack in SaaS.

Alternatives considered

Add observability later: rejected — the correlation patterns must exist from the start or retrofitting is painful and the P4 extraction is unsafe.
Vendor-specific agents/APM: rejected as the primary layer — couples us to a vendor and hurts self-host portability; OTel keeps the backend swappable.
Logs only: rejected — insufficient to trace async, cross-component flows.