04 — Deployment Strategy

Task 2: design deployment strategy. Zero-downtime, automatically-verified rollouts for stateless services; safe, operator-driven handling for stateful data; and the migration discipline that makes both possible. Decision in ADR-0029.

1. Match the strategy to the workload

Workload	Strategy	Why
`gateway`, `bitvaultd` (stateless, serves traffic)	Argo Rollouts canary + analysis	gradual, metric-verified, auto-rollback
`bitvault-web`	canary or blue-green	instant cutover + trivial rollback for SSR
workers (no inbound traffic)	rolling (maxSurge/maxUnavailable)	idempotent consumers (ADR-0006) make rolling safe
Postgres/Redis/NATS/OpenSearch	operator-managed rolling w/ failover	never canary a database; PITR safety net (12)

2. Canary with automated analysis (the prod default)

Argo Rollouts shifts traffic gradually and queries metrics at each step; a failing metric auto-aborts and restores the stable version with no human in the loop.

flowchart TB
    classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef c fill:#fde68a,stroke:#b45309,color:#111827;
    classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    new["new digest synced by ArgoCD"]:::a --> c10["canary 10% traffic"]:::c
    c10 --> an1{"AnalysisTemplate:<br/>error-rate, p99 latency, saturation OK?"}:::a
    an1 -- fail --> abort["abort → 100% stable (auto-rollback)"]:::x
    an1 -- pass --> c30["canary 30%"]:::c
    c30 --> an2{"analysis OK?"}:::a
    an2 -- fail --> abort
    an2 -- pass --> c100["promote 100% → new is stable"]:::s

Analysis metrics come from Prometheus/OTel (ADR-0013): success rate, p99 latency vs SLO (NFR-3), error budget burn, resource saturation. Thresholds = pass/fail/inconclusive.
Traffic shaping via the ingress controller or service mesh.
Blue-green alternative (esp. web): bring up green fully, smoke-test, flip the service selector, keep blue for instant rollback. Simpler than canary, doubles resources briefly.

3. Ordering with ArgoCD sync waves

Within an environment, resources apply in sync-wave order (06):

Wave	Resources
negative	namespaces, CRDs, RBAC, operators
0	data services, caches, queues, DB migrations (PreSync hook Job)
positive	app workloads (gateway, bitvaultd, workers, web)

This guarantees the database/migration is ready before the app that needs it starts — the canonical dependency that sync waves exist to manage.

4. Database migrations: expand/contract (the zero-downtime keystone)

A rollout runs two code versions simultaneously (old + canary). Schema changes must be backward-compatible or the old pods break. The discipline:

Expand — additive migration (new column/table/index), deployed before code that uses it. Old and new code both work against it.
Migrate code — new version reads/writes the new shape; backfill async if needed.
Contract — a later release removes the old shape, only after no running code references it.

Migrations run as ArgoCD PreSync hook Jobs (sync wave 0) — idempotent, forward-only (ADR-0004). Never a destructive change in the same deploy as the code that depends on it; never a down-migration against prod data (roll forward). This is what makes canary safe at the data layer.

5. Zero-downtime mechanics

Readiness probes + readiness gates — no traffic until truly ready.
PodDisruptionBudgets — voluntary disruptions (node drains, upgrades) never take the service below quorum.
Graceful shutdown — SIGTERM → stop accepting, drain in-flight (gRPC/HTTP), finish/checkpoint work, exit; preStop + terminationGracePeriod tuned per service (workers checkpoint to the durable queue, sync/10).
HPA (and KEDA on NATS queue depth for workers) for capacity during/after rollout.

6. Rollback

Situation	Rollback
Canary failing analysis	automatic abort → stable (seconds, no human)
Bad release already promoted	`git revert` the digest bump in GitOps → ArgoCD re-syncs prior digest
Schema regression	roll forward (expand/contract); PITR (12) only as last resort

Rollback is first-class and boring because artifacts are immutable and state is in Git — revert the commit, the platform converges.

7. Tradeoffs / Alternatives / Scaling

Tradeoffs. Canary + analysis is the safest but most complex (needs reliable metrics + traffic shaping + two running versions = transient extra cost). Worth it in prod; dev/staging use fast rolling.

Alternatives considered.

Plain rolling update everywhere: simplest, but a bad version reaches 100% before metrics catch it. Kept for workers/nonprod; insufficient for prod user-facing.
Recreate: downtime; only for singletons that can’t run two versions.
Blue-green everywhere: simplest rollback but 2× resources and no gradual exposure; used selectively (web).
Flagger instead of Argo Rollouts: equivalent capability; Argo Rollouts chosen for tight ArgoCD integration (ADR-0029).

Scaling concerns.

Metric reliability gates safety → analysis depends on solid SLO instrumentation (ADR-0013); a noisy metric causes false aborts. Tune thresholds + inconclusive windows.
Two-version cost during canary → bounded by short analysis windows + HPA.
Migration on huge tables → online/concurrent index builds, batched backfills off the hot path (storage/08).

References

Argo Rollouts canary & analysis: https://argo-rollouts.readthedocs.io/en/stable/features/canary/
ArgoCD sync waves: https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/
Expand/contract migrations: https://martinfowler.com/articles/evodb.html