SLOs & Error Budgets

Service Level Objectives (SLOs) are defined from the system requirements (NFR-1 and NFR-3) and enforced in two places: Prometheus alerting rules (runtime) and Argo Rollouts AnalysisTemplates (deployment gate).

SLO Targets

SLO	Target	Measurement window	Error budget (monthly)
API availability	99.9%	30-day rolling	43.8 minutes
Upload commit latency p99	< 2 s (excluding transfer time)	5-min windows	—
Download redirect latency p99	< 200 ms	5-min windows	—
Sync delta pull latency p99	< 500 ms	5-min windows	—
Search query latency p99	< 1 s	5-min windows	—

Availability is measured as 1 - (5xx_rate / total_request_rate) at the gateway. Latency SLOs exclude client-to-object-store transfer time (bytes bypass compute by design, ADR-0011).

Error Budget

Error budget = 100% - SLO target. At 99.9% availability:

Monthly budget: 43.8 minutes of allowed downtime.
Weekly budget: ~10.1 minutes.
Daily budget: ~1.44 minutes.

The budget is tracked in Prometheus as a ratio of bad minutes to total minutes in the rolling window. Dashboards show remaining budget as a percentage; alerts fire at burn-rate thresholds, not at absolute availability.

Burn-Rate Alerts

Multiwindow, multi-burn-rate alerting detects both fast burns (page immediately) and slow burns (ticket within hours) without excessive false positives.

Burn rate	Detection windows	Severity	Response
14× (depletes monthly budget in ~2 h)	1 h + 5 min	Critical — page	Immediate investigation
6× (depletes budget in ~5 h)	6 h + 30 min	Warning	Within 1 hour
3× (depletes budget in ~10 h)	1 day + 2 h	Warning	Business hours
1× (on-pace to exhaust budget at end of window)	3 days	Info	Backlog

All burn-rate alerts are defined as Prometheus PrometheusRule resources provisioned with the Helm chart.

SLO Dashboards

Grafana dashboards (provisioned under deploy/grafana/slo.json):

Current availability — 30-day rolling availability, current error rate, SLO target line.
Error budget remaining — percentage remaining this month; depletes left to right.
Burn rate — current burn rate overlaid on all SLOs; highlights fast burns before the budget is gone.
Latency SLOs — p50/p95/p99 for upload commit, download redirect, sync delta pull, and search query vs targets.

SLO Gates in Canary Deployments

:::note SLO gates canary The Argo Rollouts AnalysisTemplate (ADR-0029) queries Prometheus for error rate, p99 latency, and saturation at each canary step. A failing metric auto-aborts the canary and restores the stable revision — no manual rollback required.

SLO instrumentation is a prerequisite for safe progressive delivery. A deployment that cannot be measured cannot be safely promoted. :::

The canary analysis window runs for the step duration (default 10 minutes per step). Metrics evaluated:

http_request_error_rate (5xx / total) < 1%
http_request_duration_seconds{quantile="0.99"} within SLO target
outbox_undelivered_total not growing (async plane still draining)

A single metric failure at any step aborts promotion and pages on-call.