Observability

BitVault adopts OpenTelemetry (ADR-0013) from the first commit. All three signal types — traces, metrics, and structured logs — are instrumented in every component using a vendor-neutral SDK. Backends are a deployment choice, not a code change.

:::note Observability is not optional. The async/eventual-consistency design makes distributed tracing the only reliable way to answer “where did this file indexing go?” When a NodeChanged event travels from the commit transaction through NATS JetStream to the Search Indexer, no single log line tells the whole story — only a connected trace does. :::

Page	What it covers
OpenTelemetry Setup	SDK setup, context propagation, collector, health endpoints
Metrics	RED/USE/domain gauges, dashboards, alert thresholds
Distributed Tracing	Trace anatomy, key flows, sampling, debugging async events
SLOs & Error Budgets	SLO targets, error budgets, burn-rate alerts, canary gating

Design principles

Instrument at the seam, not the leaf. Every gRPC handler, NATS consumer, and SQL repository emits spans. Leaf utilities do not.
One trace per user action. An upload, a sync pull, a share link resolve — each is one connected trace spanning every service that touched it.
No vendor lock-in. The OTel Collector is the only sink. Swapping Jaeger for Tempo, or Prometheus for a SaaS vendor, is an ops config change.
Async-plane canaries. Outbox lag and consumer lag metrics are first-class signals; persistent lag is an alert, not a debug clue found after the fact.

Observability

Contents

Design principles