Metrics
BitVault follows three complementary metric strategies: RED for request-oriented endpoints, USE for infrastructure resources, and domain gauges that reflect the health of the eventual-consistency async plane.
RED Metrics (Request-oriented)
One RED triplet per gRPC method and per REST endpoint. Labels include tenant_id
(hashed) so per-tenant SLO tracking is possible without exposing identifiers.
| Metric | Labels | Description |
|---|---|---|
http_requests_total |
method, path, status, tenant |
Total HTTP requests at the gateway |
http_request_duration_seconds |
method, path, status |
Histogram; SLO target buckets at 200 ms, 500 ms, 2 s |
grpc_requests_total |
service, method, status_code |
Total gRPC calls (control-plane internal) |
grpc_request_duration_seconds |
service, method, status_code |
Histogram; internal latency budget |
upload_commit_duration_seconds |
status |
End-to-end commit protocol latency (init → commit OK) |
presign_duration_seconds |
provider, operation |
Time to generate a presigned URL |
sync_delta_pull_duration_seconds |
status |
Delta pull p99; NFR SLO target < 500 ms |
USE Metrics (Infrastructure-oriented)
USE covers the resources underneath the application. Collected via standard exporters (node-exporter, postgres-exporter, redis-exporter) and supplemented with Go runtime metrics from the OTel SDK.
| Category | Metrics |
|---|---|
| Utilization | CPU usage per container, JVM/Go heap utilization, Postgres connection pool utilization, object-store request concurrency |
| Saturation | Postgres connection pool wait time, Redis pipeline queue depth, NATS publish backpressure, goroutine count |
| Errors | Postgres query errors by error code, Redis connection failures, disk I/O errors (for local MinIO), OOM kills (via container restart counter) |
Domain Gauges (Eventual-Consistency Health)
These gauges are unique to BitVault’s architecture. Because derived stores (OpenSearch index, sync change journal, notifications) are updated asynchronously, a healthy system has low but non-zero lag. Sustained lag is the early warning for async-plane degradation — before users report stale search results or missed sync events.
| Metric | Description | Alert threshold |
|---|---|---|
outbox_undelivered_total |
Rows in the outbox table not yet published to NATS | Alert if lag exceeds 60 s |
nats_consumer_lag |
Messages pending per NATS consumer (subject-level) | Alert if any subject lag > 1000 messages for > 5 min |
search_index_lag_seconds |
Time since the indexer last consumed a NodeChanged event | Alert if > 120 s |
gc_backlog_total |
Orphaned blobs (refcount = 0) pending GC worker pickup | Alert if growing monotonically for > 10 min |
sync_journal_lag_events |
Events projected into change journal vs events published | Should trend to zero; alert if diverging |
notification_dispatch_lag_seconds |
Age of oldest undelivered notification | Alert if > 300 s |
:::warning Eventual-consistency gauges
outbox_undelivered_total and nats_consumer_lag are the canaries for the
async plane. Persistent lag means derived data — search results, sync journal,
notifications — is falling behind. Alert before users notice stale results.
A growing gc_backlog_total alongside rising storage costs means the finalizer
worker is not keeping up; investigate DLQ for poison messages.
:::
Key Dashboards
The following dashboards are provisioned as Grafana JSON (in deploy/grafana/
for the full tier):
- Upload throughput + latency — commit rate, p50/p95/p99 commit duration, error rate. Primary SLO dashboard.
- Sync delta pull p99 — sync latency distribution per tenant; cursor age.
- Search query p99 — OpenSearch query latency; index staleness gauge.
- Outbox / consumer lag —
outbox_undelivered_total+ per-subject NATS lag time-series; alert annotation overlay. - GC backlog —
gc_backlog_totalover time; object-store delete throughput. - Tenant quota utilization — storage bytes per tenant vs quota; approaching quota highlighted.
- Infrastructure USE — CPU/memory/connection pool saturation per service.