Rollout Strategy
Strategy by Workload
Different workload types require different rollout strategies based on their statefulness and traffic sensitivity.
| Workload | Strategy | Why |
|---|---|---|
bitvaultd (control plane API) |
Argo Rollouts canary + automated analysis | Serves live user traffic; gradual shift detects regressions before full promotion |
bitvault-web (Next.js SSR) |
Canary or blue-green | Stateless; blue-green enables instant switch-back; canary for gradual UX rollout |
bitvault-worker (async consumers) |
Rolling update | NATS consumers are idempotent (at-least-once delivery); brief dual-version operation is safe |
| PostgreSQL | Operator-managed rolling with automated failover | CloudNativePG or equivalent promotes a replica; PDB prevents simultaneous replica loss |
| Redis | Operator-managed rolling | Sentinel/Cluster mode handles leader election during rolling update |
| NATS JetStream | Operator-managed rolling | JetStream cluster re-elects on member removal; PDB ensures quorum |
| OpenSearch | Operator-managed shard rebalancing | OpenSearch Operator drains shards before pod removal |
Canary with Automated Analysis
Argo Rollouts manages the canary lifecycle for bitvaultd and bitvault-web. Promotion between canary steps requires passing an AnalysisTemplate that queries the observability stack.
flowchart TD
sync["New digest synced\nby ArgoCD"]
canary10["Canary: 10% traffic\nto new version"]
analysis1["AnalysisRun:\nerror rate < 0.1%\np99 latency < 500ms\nCPU saturation < 80%"]
canary30["Canary: 30% traffic"]
analysis2["AnalysisRun:\nsame metrics\n(extended window)"]
promote["Promote: 100%\n(stable = new version)"]
abort["Abort + auto-rollback\nto stable version"]
sync --> canary10 --> analysis1
analysis1 -->|"pass"| canary30 --> analysis2
analysis2 -->|"pass"| promote
analysis1 -->|"fail"| abort
analysis2 -->|"fail"| abort
The AnalysisTemplate queries Prometheus for:
- Error rate:
sum(rate(http_requests_total{status=~"5.."}[2m])) / sum(rate(http_requests_total[2m])) < 0.001 - p99 latency:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[2m])) < 0.5 - CPU saturation: pod CPU utilization against request < 80%
A failed analysis triggers an immediate abort: Argo Rollouts shifts 100% of traffic back to the stable version and sets the Rollout to Degraded. The GitOps repo is not automatically reverted — a human must investigate and open a revert PR or a forward-fix PR.
Database Migrations: Expand/Contract
Database schema changes follow the expand/contract pattern to ensure zero-downtime deployments. Forward-only migrations are enforced — no DOWN scripts exist in the migration files.
flowchart LR
e["1. EXPAND\nAdditive migration\n(new column, new table,\nnew index)\nDeployed BEFORE new code"]
c["2. MIGRATE CODE\nNew code reads/writes\nboth old and new shape\n(dual-read/dual-write\nif required)"]
x["3. CONTRACT\nRemove old shape\n(drop column/table)\nDeployed in a LATER\nrelease after old code\nis fully gone"]
e --> c --> x
Concrete rules:
- A migration that adds a column, table, or index may ship in the same release as the code that uses it.
- A migration that removes a column, table, or index must ship in a separate release after all code reading the old shape has been promoted to production and verified stable.
- Renaming a column is two releases: release 1 adds the new column and backfills; release 2 drops the old column after the code cutover is complete.
- Migrations run as ArgoCD PreSync hook Jobs (see GitOps). A failed migration Job blocks the code deployment.
- Migrations are idempotent and transactional where the database engine permits (DDL transactions in PostgreSQL).
:::danger Schema rollbacks Never roll back database migrations against production data. Rolling back a migration that has already been applied can destroy data written by the new schema. If a migration introduced a regression, roll forward with a corrective migration using the expand/contract pattern. Point-in-time recovery (PITR) from the PostgreSQL backup is a last resort exclusively for catastrophic data corruption — not for schema design mistakes. :::
Zero-Downtime Mechanics
The combination of the following mechanisms ensures deployments complete with no dropped requests:
| Mechanism | Role |
|---|---|
| Readiness probes | Pod removed from Service endpoints before it has finished starting or after it begins failing. No traffic is routed to an unready pod. |
| PodDisruptionBudgets | Cluster operations (node drain, cluster upgrade) cannot remove more replicas than the quorum can tolerate simultaneously. |
| Graceful shutdown | SIGTERM stops accepting new connections; in-flight requests drain before process exit. preStop hook adds a brief delay to let endpoint propagation catch up. |
| HPA | Horizontal Pod Autoscaler ensures sufficient replicas are running before a rolling update removes old ones (surge policy). |
| Argo Rollouts canary | Traffic shifts gradually; full promotion only after analysis passes. Rollback is automatic on analysis failure. |
| Expand/contract migrations | Schema changes are always additive first; destructive removals deferred to a later release. Both old and new code can run simultaneously during rollout. |
Rollback
| Situation | Rollback Action |
|---|---|
| Canary failing automated analysis | Argo Rollouts aborts and shifts traffic back to stable automatically. No manual action required unless the analysis threshold itself is wrong. |
| Bad fully-promoted release (post-canary) | Open a PR in the GitOps repo reverting the image.digest bump. ArgoCD re-syncs to the previous digest. Old pods restart from the prior (still-in-registry) image. |
| Schema migration regression (non-destructive) | Deploy a forward-fix migration using expand/contract. The prior application version must tolerate the schema during the fix rollout. |
| Schema migration regression (data corruption) | Invoke PostgreSQL PITR to restore to a pre-migration snapshot. This is a major incident procedure — data written after the restore point is lost. |
:::warning Code rollback and schema rollback are independent operations. Rolling back the application code does not roll back the database schema. Always verify that the prior code version is compatible with the current (expanded) schema before initiating a code rollback. :::