Distributed Tracing

Every user-visible action in BitVault is instrumented as a single distributed trace. The trace ID never breaks as the request crosses process boundaries.

Trace ID Propagation

The trace ID is created at the API Gateway on each inbound request following W3C TraceContext (traceparent header). Propagation path:

HTTP request (traceparent header)
  → Gateway (root span)
    → gRPC call (grpc-trace-bin / traceparent in :authority metadata)
      → gRPC service (child span)
        → NATS publish (traceparent in NATS message header)
          → NATS consumer / async worker (child span)

Because the NATS header carries the trace context, async derivation (indexing, notifications, GC) is part of the same trace tree — not a separate trace you have to correlate manually.

Key Traces

Upload commit flow

The most important trace in the system. Spans:

  1. gateway.upload.init — validate request, call File & Metadata
  2. files.create_upload — create draft node, call Storage for presign
  3. storage.presign_put — generate presigned URL
  4. gateway.upload.commit — client-side PUT complete; client POSTs commit
  5. files.commit_upload — verify size/etag/hash via Storage HEAD
  6. storage.head_object — HEAD staging key against object store
  7. files.tx_commitBEGIN TX; INSERT node version; INSERT outbox; COMMIT
  8. outbox.publish — drainer reads outbox row, publishes NodeChanged to NATS
  9. indexer.consume_node_changed — indexer spans consuming the event

The entire path from HTTP POST to search index update is one trace.

Download flow

gateway.downloadsharing.check_accessfiles.resolve_versionstorage.presign_get302 redirect to object store.

Sync delta pull

gateway.sync.pullsync.pull_deltas(cursor=N)sync.journal.query → response with changes[N+1..M].

Conflict resolution

gateway.sync.pushsync.push(baseVersion)files.compare_and_commit → on conflict: files.create_conflicted_copy → outbox event → NATS.

Required Span Attributes

Every span in the system carries:

Attribute Value Note
tenant.id Hashed tenant ID Never the raw UUID in logs; hash preserves correlation
user.id Hashed user ID Same
node.id File/folder node UUID Where applicable
request.id UUID per HTTP request Returned in error responses for support
service.name e.g. bitvaultd.files Set via OTel resource
service.version Semver from build Set via OTel resource

Sampling Strategy

Traffic type Sampling decision
Normal operations Head-based sampling at gateway; configurable rate (default 10%)
Errors (any 5xx or gRPC error) Always sampled
Slow requests (> 2× SLO target) Always sampled via tail-based rule in OTel Collector
Canary deployments 100% during canary analysis window (ADR-0029)

Sampling configuration lives in the OTel Collector pipeline config, not in application code. Changing the rate does not require a redeploy.

Finding a Trace

Given a user complaint (“my file hasn’t appeared in search after 5 minutes”):

  1. By trace ID: If the client received a request_id in any API response, look it up directly in Jaeger/Tempo. Every span in the upload trace carries this ID.
  2. By attributes: Search tenant.id=<hash> AND node.id=<uuid> with a time range. Find the files.tx_commit span and follow children to see if indexer.consume_node_changed completed.
  3. NATS message header: If the trace ends at the outbox publish, find the NATS message for the NodeChanged event (by node.id subject) and read the traceparent header to get the full consumer span tree.

:::tip Debugging async flows To trace why a file’s indexing is delayed: find the files.tx_commit span and check whether outbox.publish has a child span. If not, the outbox drainer is lagging — check outbox_undelivered_total. If outbox.publish completed but there is no indexer.consume_node_changed child, the NATS consumer is behind or a DLQ has a poison message. The nats_consumer_lag metric for the node.changed subject will confirm. :::