ADR-0013 — OpenTelemetry for traces, metrics, and logs from day one
Context
An event-driven system that will become distributed is undebuggable without correlation across REST → gRPC → NATS → workers (R10). Observability is frequently bolted on late, after the patterns that make it useful are missing. For BitVault it is also an explicit portfolio competency. The async/eventual-consistency design (ADR-0006) makes “where did this file’s indexing go?” a routine question that only tracing can answer cheaply.
Decision
- Adopt OpenTelemetry (OTel) from commit #1 (P0) — vendor-neutral SDK for traces, metrics, and logs across every component.
- One trace per user action. A trace/correlation ID is minted at the gateway and propagated through gRPC metadata and NATS message headers, so a single upload is one connected trace spanning gateway → File → Storage → bus → indexer (NFR-7 acceptance).
- Metrics: RED (Rate/Errors/Duration) per endpoint and gRPC method; USE for infra; domain gauges for outbox lag, consumer lag, search-index lag, GC backlog — the eventual-consistency health signals (ADR-0006/0009).
- Logs: structured JSON, correlated by trace ID; no PII in logs.
- Health:
/healthz(liveness) and/readyz(readiness, incl. dependency checks) on every component; graceful shutdown drains in-flight work. - Export to an OTel Collector; backends are deployment choices (e.g. Prometheus + Tempo/Jaeger + Loki, or a vendor) — not hard-coded.
- SLOs + error budgets defined for the NFR-3 latency and NFR-1 availability targets, with alerts on burn rate.
Consequences
Positive
- Distributed debugging is possible before the system is distributed — extraction (P4) is safe because we can prove behavior with traces before and after.
- Eventual-consistency lag is measured, not guessed (outbox/consumer/index lag).
- Vendor-neutral: self-host and SaaS pick their own backends behind the Collector.
- Satisfies a portfolio goal with a real, load-bearing implementation (not a demo).
Negative / costs
- Instrumentation is upfront effort on every code path and a small runtime overhead (sampling tunes cost). Justified: it is cheaper than debugging blind later.
- Trace context propagation through NATS requires disciplined header handling
(centralized in
internal/platform/bus). - An observability backend is another thing to run; for
liteself-host it can be minimal/optional (logs + basic metrics), full stack in SaaS.
Alternatives considered
- Add observability later: rejected — the correlation patterns must exist from the start or retrofitting is painful and the P4 extraction is unsafe.
- Vendor-specific agents/APM: rejected as the primary layer — couples us to a vendor and hurts self-host portability; OTel keeps the backend swappable.
- Logs only: rejected — insufficient to trace async, cross-component flows.