Metrics

BitVault follows three complementary metric strategies: RED for request-oriented endpoints, USE for infrastructure resources, and domain gauges that reflect the health of the eventual-consistency async plane.

RED Metrics (Request-oriented)

One RED triplet per gRPC method and per REST endpoint. Labels include tenant_id (hashed) so per-tenant SLO tracking is possible without exposing identifiers.

Metric Labels Description
http_requests_total method, path, status, tenant Total HTTP requests at the gateway
http_request_duration_seconds method, path, status Histogram; SLO target buckets at 200 ms, 500 ms, 2 s
grpc_requests_total service, method, status_code Total gRPC calls (control-plane internal)
grpc_request_duration_seconds service, method, status_code Histogram; internal latency budget
upload_commit_duration_seconds status End-to-end commit protocol latency (init → commit OK)
presign_duration_seconds provider, operation Time to generate a presigned URL
sync_delta_pull_duration_seconds status Delta pull p99; NFR SLO target < 500 ms

USE Metrics (Infrastructure-oriented)

USE covers the resources underneath the application. Collected via standard exporters (node-exporter, postgres-exporter, redis-exporter) and supplemented with Go runtime metrics from the OTel SDK.

Category Metrics
Utilization CPU usage per container, JVM/Go heap utilization, Postgres connection pool utilization, object-store request concurrency
Saturation Postgres connection pool wait time, Redis pipeline queue depth, NATS publish backpressure, goroutine count
Errors Postgres query errors by error code, Redis connection failures, disk I/O errors (for local MinIO), OOM kills (via container restart counter)

Domain Gauges (Eventual-Consistency Health)

These gauges are unique to BitVault’s architecture. Because derived stores (OpenSearch index, sync change journal, notifications) are updated asynchronously, a healthy system has low but non-zero lag. Sustained lag is the early warning for async-plane degradation — before users report stale search results or missed sync events.

Metric Description Alert threshold
outbox_undelivered_total Rows in the outbox table not yet published to NATS Alert if lag exceeds 60 s
nats_consumer_lag Messages pending per NATS consumer (subject-level) Alert if any subject lag > 1000 messages for > 5 min
search_index_lag_seconds Time since the indexer last consumed a NodeChanged event Alert if > 120 s
gc_backlog_total Orphaned blobs (refcount = 0) pending GC worker pickup Alert if growing monotonically for > 10 min
sync_journal_lag_events Events projected into change journal vs events published Should trend to zero; alert if diverging
notification_dispatch_lag_seconds Age of oldest undelivered notification Alert if > 300 s

:::warning Eventual-consistency gauges outbox_undelivered_total and nats_consumer_lag are the canaries for the async plane. Persistent lag means derived data — search results, sync journal, notifications — is falling behind. Alert before users notice stale results. A growing gc_backlog_total alongside rising storage costs means the finalizer worker is not keeping up; investigate DLQ for poison messages. :::

Key Dashboards

The following dashboards are provisioned as Grafana JSON (in deploy/grafana/ for the full tier):

  1. Upload throughput + latency — commit rate, p50/p95/p99 commit duration, error rate. Primary SLO dashboard.
  2. Sync delta pull p99 — sync latency distribution per tenant; cursor age.
  3. Search query p99 — OpenSearch query latency; index staleness gauge.
  4. Outbox / consumer lagoutbox_undelivered_total + per-subject NATS lag time-series; alert annotation overlay.
  5. GC backloggc_backlog_total over time; object-store delete throughput.
  6. Tenant quota utilization — storage bytes per tenant vs quota; approaching quota highlighted.
  7. Infrastructure USE — CPU/memory/connection pool saturation per service.