Observability
BitVault adopts OpenTelemetry (ADR-0013) from the first commit. All three signal types — traces, metrics, and structured logs — are instrumented in every component using a vendor-neutral SDK. Backends are a deployment choice, not a code change.
:::note Observability is not optional. The async/eventual-consistency design makes distributed tracing the only reliable way to answer “where did this file indexing go?” When a NodeChanged event travels from the commit transaction through NATS JetStream to the Search Indexer, no single log line tells the whole story — only a connected trace does. :::
Contents
| Page | What it covers |
|---|---|
| OpenTelemetry Setup | SDK setup, context propagation, collector, health endpoints |
| Metrics | RED/USE/domain gauges, dashboards, alert thresholds |
| Distributed Tracing | Trace anatomy, key flows, sampling, debugging async events |
| SLOs & Error Budgets | SLO targets, error budgets, burn-rate alerts, canary gating |
Design principles
- Instrument at the seam, not the leaf. Every gRPC handler, NATS consumer, and SQL repository emits spans. Leaf utilities do not.
- One trace per user action. An upload, a sync pull, a share link resolve — each is one connected trace spanning every service that touched it.
- No vendor lock-in. The OTel Collector is the only sink. Swapping Jaeger for Tempo, or Prometheus for a SaaS vendor, is an ops config change.
- Async-plane canaries. Outbox lag and consumer lag metrics are first-class signals; persistent lag is an alert, not a debug clue found after the fact.