SLOs & Error Budgets
Service Level Objectives (SLOs) are defined from the system requirements (NFR-1 and NFR-3) and enforced in two places: Prometheus alerting rules (runtime) and Argo Rollouts AnalysisTemplates (deployment gate).
SLO Targets
| SLO | Target | Measurement window | Error budget (monthly) |
|---|---|---|---|
| API availability | 99.9% | 30-day rolling | 43.8 minutes |
| Upload commit latency p99 | < 2 s (excluding transfer time) | 5-min windows | — |
| Download redirect latency p99 | < 200 ms | 5-min windows | — |
| Sync delta pull latency p99 | < 500 ms | 5-min windows | — |
| Search query latency p99 | < 1 s | 5-min windows | — |
Availability is measured as 1 - (5xx_rate / total_request_rate) at the gateway.
Latency SLOs exclude client-to-object-store transfer time (bytes bypass compute
by design, ADR-0011).
Error Budget
Error budget = 100% - SLO target. At 99.9% availability:
- Monthly budget: 43.8 minutes of allowed downtime.
- Weekly budget: ~10.1 minutes.
- Daily budget: ~1.44 minutes.
The budget is tracked in Prometheus as a ratio of bad minutes to total minutes in the rolling window. Dashboards show remaining budget as a percentage; alerts fire at burn-rate thresholds, not at absolute availability.
Burn-Rate Alerts
Multiwindow, multi-burn-rate alerting detects both fast burns (page immediately) and slow burns (ticket within hours) without excessive false positives.
| Burn rate | Detection windows | Severity | Response |
|---|---|---|---|
| 14× (depletes monthly budget in ~2 h) | 1 h + 5 min | Critical — page | Immediate investigation |
| 6× (depletes budget in ~5 h) | 6 h + 30 min | Warning | Within 1 hour |
| 3× (depletes budget in ~10 h) | 1 day + 2 h | Warning | Business hours |
| 1× (on-pace to exhaust budget at end of window) | 3 days | Info | Backlog |
All burn-rate alerts are defined as Prometheus PrometheusRule resources
provisioned with the Helm chart.
SLO Dashboards
Grafana dashboards (provisioned under deploy/grafana/slo.json):
- Current availability — 30-day rolling availability, current error rate, SLO target line.
- Error budget remaining — percentage remaining this month; depletes left to right.
- Burn rate — current burn rate overlaid on all SLOs; highlights fast burns before the budget is gone.
- Latency SLOs — p50/p95/p99 for upload commit, download redirect, sync delta pull, and search query vs targets.
SLO Gates in Canary Deployments
:::note SLO gates canary
The Argo Rollouts AnalysisTemplate (ADR-0029) queries Prometheus for error
rate, p99 latency, and saturation at each canary step. A failing metric
auto-aborts the canary and restores the stable revision — no manual
rollback required.
SLO instrumentation is a prerequisite for safe progressive delivery. A deployment that cannot be measured cannot be safely promoted. :::
The canary analysis window runs for the step duration (default 10 minutes per step). Metrics evaluated:
http_request_error_rate(5xx / total) < 1%http_request_duration_seconds{quantile="0.99"}within SLO targetoutbox_undelivered_totalnot growing (async plane still draining)
A single metric failure at any step aborts promotion and pages on-call.