04 — Deployment Strategy
Task 2: design deployment strategy. Zero-downtime, automatically-verified rollouts for stateless services; safe, operator-driven handling for stateful data; and the migration discipline that makes both possible. Decision in ADR-0029.
1. Match the strategy to the workload
| Workload | Strategy | Why |
|---|---|---|
gateway, bitvaultd (stateless, serves traffic) |
Argo Rollouts canary + analysis | gradual, metric-verified, auto-rollback |
bitvault-web |
canary or blue-green | instant cutover + trivial rollback for SSR |
| workers (no inbound traffic) | rolling (maxSurge/maxUnavailable) | idempotent consumers (ADR-0006) make rolling safe |
| Postgres/Redis/NATS/OpenSearch | operator-managed rolling w/ failover | never canary a database; PITR safety net (12) |
2. Canary with automated analysis (the prod default)
Argo Rollouts shifts traffic gradually and queries metrics at each step; a failing metric auto-aborts and restores the stable version with no human in the loop.
flowchart TB
classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
classDef c fill:#fde68a,stroke:#b45309,color:#111827;
classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
new["new digest synced by ArgoCD"]:::a --> c10["canary 10% traffic"]:::c
c10 --> an1{"AnalysisTemplate:<br/>error-rate, p99 latency, saturation OK?"}:::a
an1 -- fail --> abort["abort → 100% stable (auto-rollback)"]:::x
an1 -- pass --> c30["canary 30%"]:::c
c30 --> an2{"analysis OK?"}:::a
an2 -- fail --> abort
an2 -- pass --> c100["promote 100% → new is stable"]:::s
- Analysis metrics come from Prometheus/OTel (ADR-0013): success rate, p99 latency vs SLO (NFR-3), error budget burn, resource saturation. Thresholds = pass/fail/inconclusive.
- Traffic shaping via the ingress controller or service mesh.
- Blue-green alternative (esp.
web): bring up green fully, smoke-test, flip the service selector, keep blue for instant rollback. Simpler than canary, doubles resources briefly.
3. Ordering with ArgoCD sync waves
Within an environment, resources apply in sync-wave order (06):
| Wave | Resources |
|---|---|
| negative | namespaces, CRDs, RBAC, operators |
| 0 | data services, caches, queues, DB migrations (PreSync hook Job) |
| positive | app workloads (gateway, bitvaultd, workers, web) |
This guarantees the database/migration is ready before the app that needs it starts — the canonical dependency that sync waves exist to manage.
4. Database migrations: expand/contract (the zero-downtime keystone)
A rollout runs two code versions simultaneously (old + canary). Schema changes must be backward-compatible or the old pods break. The discipline:
- Expand — additive migration (new column/table/index), deployed before code that uses it. Old and new code both work against it.
- Migrate code — new version reads/writes the new shape; backfill async if needed.
- Contract — a later release removes the old shape, only after no running code references it.
Migrations run as ArgoCD PreSync hook Jobs (sync wave 0) — idempotent, forward-only (ADR-0004). Never a destructive change in the same deploy as the code that depends on it; never a down-migration against prod data (roll forward). This is what makes canary safe at the data layer.
5. Zero-downtime mechanics
- Readiness probes + readiness gates — no traffic until truly ready.
- PodDisruptionBudgets — voluntary disruptions (node drains, upgrades) never take the service below quorum.
- Graceful shutdown —
SIGTERM→ stop accepting, drain in-flight (gRPC/HTTP), finish/checkpoint work, exit;preStop+terminationGracePeriodtuned per service (workers checkpoint to the durable queue, sync/10). - HPA (and KEDA on NATS queue depth for workers) for capacity during/after rollout.
6. Rollback
| Situation | Rollback |
|---|---|
| Canary failing analysis | automatic abort → stable (seconds, no human) |
| Bad release already promoted | git revert the digest bump in GitOps → ArgoCD re-syncs prior digest |
| Schema regression | roll forward (expand/contract); PITR (12) only as last resort |
Rollback is first-class and boring because artifacts are immutable and state is in Git — revert the commit, the platform converges.
7. Tradeoffs / Alternatives / Scaling
Tradeoffs. Canary + analysis is the safest but most complex (needs reliable metrics + traffic shaping + two running versions = transient extra cost). Worth it in prod; dev/staging use fast rolling.
Alternatives considered.
- Plain rolling update everywhere: simplest, but a bad version reaches 100% before metrics catch it. Kept for workers/nonprod; insufficient for prod user-facing.
- Recreate: downtime; only for singletons that can’t run two versions.
- Blue-green everywhere: simplest rollback but 2× resources and no gradual exposure; used selectively (web).
- Flagger instead of Argo Rollouts: equivalent capability; Argo Rollouts chosen for tight ArgoCD integration (ADR-0029).
Scaling concerns.
- Metric reliability gates safety → analysis depends on solid SLO instrumentation (ADR-0013); a noisy metric causes false aborts. Tune thresholds + inconclusive windows.
- Two-version cost during canary → bounded by short analysis windows + HPA.
- Migration on huge tables → online/concurrent index builds, batched backfills off the hot path (storage/08).
References
- Argo Rollouts canary & analysis: https://argo-rollouts.readthedocs.io/en/stable/features/canary/
- ArgoCD sync waves: https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/
- Expand/contract migrations: https://martinfowler.com/articles/evodb.html