04 — Deployment Strategy

Task 2: design deployment strategy. Zero-downtime, automatically-verified rollouts for stateless services; safe, operator-driven handling for stateful data; and the migration discipline that makes both possible. Decision in ADR-0029.


1. Match the strategy to the workload

Workload Strategy Why
gateway, bitvaultd (stateless, serves traffic) Argo Rollouts canary + analysis gradual, metric-verified, auto-rollback
bitvault-web canary or blue-green instant cutover + trivial rollback for SSR
workers (no inbound traffic) rolling (maxSurge/maxUnavailable) idempotent consumers (ADR-0006) make rolling safe
Postgres/Redis/NATS/OpenSearch operator-managed rolling w/ failover never canary a database; PITR safety net (12)

2. Canary with automated analysis (the prod default)

Argo Rollouts shifts traffic gradually and queries metrics at each step; a failing metric auto-aborts and restores the stable version with no human in the loop.

flowchart TB
    classDef s fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef c fill:#fde68a,stroke:#b45309,color:#111827;
    classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef x fill:#fecaca,stroke:#b91c1c,color:#111827;
    new["new digest synced by ArgoCD"]:::a --> c10["canary 10% traffic"]:::c
    c10 --> an1{"AnalysisTemplate:<br/>error-rate, p99 latency, saturation OK?"}:::a
    an1 -- fail --> abort["abort → 100% stable (auto-rollback)"]:::x
    an1 -- pass --> c30["canary 30%"]:::c
    c30 --> an2{"analysis OK?"}:::a
    an2 -- fail --> abort
    an2 -- pass --> c100["promote 100% → new is stable"]:::s

3. Ordering with ArgoCD sync waves

Within an environment, resources apply in sync-wave order (06):

Wave Resources
negative namespaces, CRDs, RBAC, operators
0 data services, caches, queues, DB migrations (PreSync hook Job)
positive app workloads (gateway, bitvaultd, workers, web)

This guarantees the database/migration is ready before the app that needs it starts — the canonical dependency that sync waves exist to manage.


4. Database migrations: expand/contract (the zero-downtime keystone)

A rollout runs two code versions simultaneously (old + canary). Schema changes must be backward-compatible or the old pods break. The discipline:

  1. Expand — additive migration (new column/table/index), deployed before code that uses it. Old and new code both work against it.
  2. Migrate code — new version reads/writes the new shape; backfill async if needed.
  3. Contract — a later release removes the old shape, only after no running code references it.

Migrations run as ArgoCD PreSync hook Jobs (sync wave 0) — idempotent, forward-only (ADR-0004). Never a destructive change in the same deploy as the code that depends on it; never a down-migration against prod data (roll forward). This is what makes canary safe at the data layer.


5. Zero-downtime mechanics


6. Rollback

Situation Rollback
Canary failing analysis automatic abort → stable (seconds, no human)
Bad release already promoted git revert the digest bump in GitOps → ArgoCD re-syncs prior digest
Schema regression roll forward (expand/contract); PITR (12) only as last resort

Rollback is first-class and boring because artifacts are immutable and state is in Git — revert the commit, the platform converges.


7. Tradeoffs / Alternatives / Scaling

Tradeoffs. Canary + analysis is the safest but most complex (needs reliable metrics + traffic shaping + two running versions = transient extra cost). Worth it in prod; dev/staging use fast rolling.

Alternatives considered.

Scaling concerns.

References