SLOs & Error Budgets

Service Level Objectives (SLOs) are defined from the system requirements (NFR-1 and NFR-3) and enforced in two places: Prometheus alerting rules (runtime) and Argo Rollouts AnalysisTemplates (deployment gate).

SLO Targets

SLO Target Measurement window Error budget (monthly)
API availability 99.9% 30-day rolling 43.8 minutes
Upload commit latency p99 < 2 s (excluding transfer time) 5-min windows
Download redirect latency p99 < 200 ms 5-min windows
Sync delta pull latency p99 < 500 ms 5-min windows
Search query latency p99 < 1 s 5-min windows

Availability is measured as 1 - (5xx_rate / total_request_rate) at the gateway. Latency SLOs exclude client-to-object-store transfer time (bytes bypass compute by design, ADR-0011).

Error Budget

Error budget = 100% - SLO target. At 99.9% availability:

The budget is tracked in Prometheus as a ratio of bad minutes to total minutes in the rolling window. Dashboards show remaining budget as a percentage; alerts fire at burn-rate thresholds, not at absolute availability.

Burn-Rate Alerts

Multiwindow, multi-burn-rate alerting detects both fast burns (page immediately) and slow burns (ticket within hours) without excessive false positives.

Burn rate Detection windows Severity Response
14× (depletes monthly budget in ~2 h) 1 h + 5 min Critical — page Immediate investigation
6× (depletes budget in ~5 h) 6 h + 30 min Warning Within 1 hour
3× (depletes budget in ~10 h) 1 day + 2 h Warning Business hours
1× (on-pace to exhaust budget at end of window) 3 days Info Backlog

All burn-rate alerts are defined as Prometheus PrometheusRule resources provisioned with the Helm chart.

SLO Dashboards

Grafana dashboards (provisioned under deploy/grafana/slo.json):

  1. Current availability — 30-day rolling availability, current error rate, SLO target line.
  2. Error budget remaining — percentage remaining this month; depletes left to right.
  3. Burn rate — current burn rate overlaid on all SLOs; highlights fast burns before the budget is gone.
  4. Latency SLOs — p50/p95/p99 for upload commit, download redirect, sync delta pull, and search query vs targets.

SLO Gates in Canary Deployments

:::note SLO gates canary The Argo Rollouts AnalysisTemplate (ADR-0029) queries Prometheus for error rate, p99 latency, and saturation at each canary step. A failing metric auto-aborts the canary and restores the stable revision — no manual rollback required.

SLO instrumentation is a prerequisite for safe progressive delivery. A deployment that cannot be measured cannot be safely promoted. :::

The canary analysis window runs for the step duration (default 10 minutes per step). Metrics evaluated:

A single metric failure at any step aborts promotion and pages on-call.