Metrics

BitVault follows three complementary metric strategies: RED for request-oriented endpoints, USE for infrastructure resources, and domain gauges that reflect the health of the eventual-consistency async plane.

RED Metrics (Request-oriented)

One RED triplet per gRPC method and per REST endpoint. Labels include tenant_id (hashed) so per-tenant SLO tracking is possible without exposing identifiers.

Metric	Labels	Description
`http_requests_total`	`method`, `path`, `status`, `tenant`	Total HTTP requests at the gateway
`http_request_duration_seconds`	`method`, `path`, `status`	Histogram; SLO target buckets at 200 ms, 500 ms, 2 s
`grpc_requests_total`	`service`, `method`, `status_code`	Total gRPC calls (control-plane internal)
`grpc_request_duration_seconds`	`service`, `method`, `status_code`	Histogram; internal latency budget
`upload_commit_duration_seconds`	`status`	End-to-end commit protocol latency (init → commit OK)
`presign_duration_seconds`	`provider`, `operation`	Time to generate a presigned URL
`sync_delta_pull_duration_seconds`	`status`	Delta pull p99; NFR SLO target < 500 ms

USE Metrics (Infrastructure-oriented)

USE covers the resources underneath the application. Collected via standard exporters (node-exporter, postgres-exporter, redis-exporter) and supplemented with Go runtime metrics from the OTel SDK.

Category	Metrics
Utilization	CPU usage per container, JVM/Go heap utilization, Postgres connection pool utilization, object-store request concurrency
Saturation	Postgres connection pool wait time, Redis pipeline queue depth, NATS publish backpressure, goroutine count
Errors	Postgres query errors by error code, Redis connection failures, disk I/O errors (for local MinIO), OOM kills (via container restart counter)

Domain Gauges (Eventual-Consistency Health)

These gauges are unique to BitVault’s architecture. Because derived stores (OpenSearch index, sync change journal, notifications) are updated asynchronously, a healthy system has low but non-zero lag. Sustained lag is the early warning for async-plane degradation — before users report stale search results or missed sync events.

Metric	Description	Alert threshold
`outbox_undelivered_total`	Rows in the outbox table not yet published to NATS	Alert if lag exceeds 60 s
`nats_consumer_lag`	Messages pending per NATS consumer (subject-level)	Alert if any subject lag > 1000 messages for > 5 min
`search_index_lag_seconds`	Time since the indexer last consumed a NodeChanged event	Alert if > 120 s
`gc_backlog_total`	Orphaned blobs (refcount = 0) pending GC worker pickup	Alert if growing monotonically for > 10 min
`sync_journal_lag_events`	Events projected into change journal vs events published	Should trend to zero; alert if diverging
`notification_dispatch_lag_seconds`	Age of oldest undelivered notification	Alert if > 300 s

:::warning Eventual-consistency gauges outbox_undelivered_total and nats_consumer_lag are the canaries for the async plane. Persistent lag means derived data — search results, sync journal, notifications — is falling behind. Alert before users notice stale results. A growing gc_backlog_total alongside rising storage costs means the finalizer worker is not keeping up; investigate DLQ for poison messages. :::

Key Dashboards

The following dashboards are provisioned as Grafana JSON (in deploy/grafana/ for the full tier):

Upload throughput + latency — commit rate, p50/p95/p99 commit duration, error rate. Primary SLO dashboard.
Sync delta pull p99 — sync latency distribution per tenant; cursor age.
Search query p99 — OpenSearch query latency; index staleness gauge.
Outbox / consumer lag — outbox_undelivered_total + per-subject NATS lag time-series; alert annotation overlay.
GC backlog — gc_backlog_total over time; object-store delete throughput.
Tenant quota utilization — storage bytes per tenant vs quota; approaching quota highlighted.
Infrastructure USE — CPU/memory/connection pool saturation per service.