Distributed Tracing
Every user-visible action in BitVault is instrumented as a single distributed trace. The trace ID never breaks as the request crosses process boundaries.
Trace ID Propagation
The trace ID is created at the API Gateway on each inbound request following
W3C TraceContext (traceparent header).
Propagation path:
HTTP request (traceparent header)
→ Gateway (root span)
→ gRPC call (grpc-trace-bin / traceparent in :authority metadata)
→ gRPC service (child span)
→ NATS publish (traceparent in NATS message header)
→ NATS consumer / async worker (child span)
Because the NATS header carries the trace context, async derivation (indexing, notifications, GC) is part of the same trace tree — not a separate trace you have to correlate manually.
Key Traces
Upload commit flow
The most important trace in the system. Spans:
gateway.upload.init— validate request, call File & Metadatafiles.create_upload— create draft node, call Storage for presignstorage.presign_put— generate presigned URLgateway.upload.commit— client-side PUT complete; client POSTs commitfiles.commit_upload— verify size/etag/hash via Storage HEADstorage.head_object— HEAD staging key against object storefiles.tx_commit—BEGIN TX; INSERT node version; INSERT outbox; COMMIToutbox.publish— drainer reads outbox row, publishes NodeChanged to NATSindexer.consume_node_changed— indexer spans consuming the event
The entire path from HTTP POST to search index update is one trace.
Download flow
gateway.download → sharing.check_access → files.resolve_version →
storage.presign_get → 302 redirect to object store.
Sync delta pull
gateway.sync.pull → sync.pull_deltas(cursor=N) → sync.journal.query →
response with changes[N+1..M].
Conflict resolution
gateway.sync.push → sync.push(baseVersion) → files.compare_and_commit →
on conflict: files.create_conflicted_copy → outbox event → NATS.
Required Span Attributes
Every span in the system carries:
| Attribute | Value | Note |
|---|---|---|
tenant.id |
Hashed tenant ID | Never the raw UUID in logs; hash preserves correlation |
user.id |
Hashed user ID | Same |
node.id |
File/folder node UUID | Where applicable |
request.id |
UUID per HTTP request | Returned in error responses for support |
service.name |
e.g. bitvaultd.files |
Set via OTel resource |
service.version |
Semver from build | Set via OTel resource |
Sampling Strategy
| Traffic type | Sampling decision |
|---|---|
| Normal operations | Head-based sampling at gateway; configurable rate (default 10%) |
| Errors (any 5xx or gRPC error) | Always sampled |
| Slow requests (> 2× SLO target) | Always sampled via tail-based rule in OTel Collector |
| Canary deployments | 100% during canary analysis window (ADR-0029) |
Sampling configuration lives in the OTel Collector pipeline config, not in application code. Changing the rate does not require a redeploy.
Finding a Trace
Given a user complaint (“my file hasn’t appeared in search after 5 minutes”):
- By trace ID: If the client received a
request_idin any API response, look it up directly in Jaeger/Tempo. Every span in the upload trace carries this ID. - By attributes: Search
tenant.id=<hash> AND node.id=<uuid>with a time range. Find thefiles.tx_commitspan and follow children to see ifindexer.consume_node_changedcompleted. - NATS message header: If the trace ends at the outbox publish, find the
NATS message for the NodeChanged event (by
node.idsubject) and read thetraceparentheader to get the full consumer span tree.
:::tip Debugging async flows
To trace why a file’s indexing is delayed: find the files.tx_commit span and
check whether outbox.publish has a child span. If not, the outbox drainer is
lagging — check outbox_undelivered_total. If outbox.publish completed but
there is no indexer.consume_node_changed child, the NATS consumer is behind
or a DLQ has a poison message. The nats_consumer_lag metric for the
node.changed subject will confirm.
:::