Rollout Strategy

Strategy by Workload

Different workload types require different rollout strategies based on their statefulness and traffic sensitivity.

Workload Strategy Why
bitvaultd (control plane API) Argo Rollouts canary + automated analysis Serves live user traffic; gradual shift detects regressions before full promotion
bitvault-web (Next.js SSR) Canary or blue-green Stateless; blue-green enables instant switch-back; canary for gradual UX rollout
bitvault-worker (async consumers) Rolling update NATS consumers are idempotent (at-least-once delivery); brief dual-version operation is safe
PostgreSQL Operator-managed rolling with automated failover CloudNativePG or equivalent promotes a replica; PDB prevents simultaneous replica loss
Redis Operator-managed rolling Sentinel/Cluster mode handles leader election during rolling update
NATS JetStream Operator-managed rolling JetStream cluster re-elects on member removal; PDB ensures quorum
OpenSearch Operator-managed shard rebalancing OpenSearch Operator drains shards before pod removal

Canary with Automated Analysis

Argo Rollouts manages the canary lifecycle for bitvaultd and bitvault-web. Promotion between canary steps requires passing an AnalysisTemplate that queries the observability stack.

flowchart TD
    sync["New digest synced\nby ArgoCD"]
    canary10["Canary: 10% traffic\nto new version"]
    analysis1["AnalysisRun:\nerror rate < 0.1%\np99 latency < 500ms\nCPU saturation < 80%"]
    canary30["Canary: 30% traffic"]
    analysis2["AnalysisRun:\nsame metrics\n(extended window)"]
    promote["Promote: 100%\n(stable = new version)"]
    abort["Abort + auto-rollback\nto stable version"]

    sync --> canary10 --> analysis1
    analysis1 -->|"pass"| canary30 --> analysis2
    analysis2 -->|"pass"| promote
    analysis1 -->|"fail"| abort
    analysis2 -->|"fail"| abort

The AnalysisTemplate queries Prometheus for:

A failed analysis triggers an immediate abort: Argo Rollouts shifts 100% of traffic back to the stable version and sets the Rollout to Degraded. The GitOps repo is not automatically reverted — a human must investigate and open a revert PR or a forward-fix PR.

Database Migrations: Expand/Contract

Database schema changes follow the expand/contract pattern to ensure zero-downtime deployments. Forward-only migrations are enforced — no DOWN scripts exist in the migration files.

flowchart LR
    e["1. EXPAND\nAdditive migration\n(new column, new table,\nnew index)\nDeployed BEFORE new code"]
    c["2. MIGRATE CODE\nNew code reads/writes\nboth old and new shape\n(dual-read/dual-write\nif required)"]
    x["3. CONTRACT\nRemove old shape\n(drop column/table)\nDeployed in a LATER\nrelease after old code\nis fully gone"]

    e --> c --> x

Concrete rules:

:::danger Schema rollbacks Never roll back database migrations against production data. Rolling back a migration that has already been applied can destroy data written by the new schema. If a migration introduced a regression, roll forward with a corrective migration using the expand/contract pattern. Point-in-time recovery (PITR) from the PostgreSQL backup is a last resort exclusively for catastrophic data corruption — not for schema design mistakes. :::

Zero-Downtime Mechanics

The combination of the following mechanisms ensures deployments complete with no dropped requests:

Mechanism Role
Readiness probes Pod removed from Service endpoints before it has finished starting or after it begins failing. No traffic is routed to an unready pod.
PodDisruptionBudgets Cluster operations (node drain, cluster upgrade) cannot remove more replicas than the quorum can tolerate simultaneously.
Graceful shutdown SIGTERM stops accepting new connections; in-flight requests drain before process exit. preStop hook adds a brief delay to let endpoint propagation catch up.
HPA Horizontal Pod Autoscaler ensures sufficient replicas are running before a rolling update removes old ones (surge policy).
Argo Rollouts canary Traffic shifts gradually; full promotion only after analysis passes. Rollback is automatic on analysis failure.
Expand/contract migrations Schema changes are always additive first; destructive removals deferred to a later release. Both old and new code can run simultaneously during rollout.

Rollback

Situation Rollback Action
Canary failing automated analysis Argo Rollouts aborts and shifts traffic back to stable automatically. No manual action required unless the analysis threshold itself is wrong.
Bad fully-promoted release (post-canary) Open a PR in the GitOps repo reverting the image.digest bump. ArgoCD re-syncs to the previous digest. Old pods restart from the prior (still-in-registry) image.
Schema migration regression (non-destructive) Deploy a forward-fix migration using expand/contract. The prior application version must tolerate the schema during the fix rollout.
Schema migration regression (data corruption) Invoke PostgreSQL PITR to restore to a pre-migration snapshot. This is a major incident procedure — data written after the restore point is lost.

:::warning Code rollback and schema rollback are independent operations. Rolling back the application code does not roll back the database schema. Always verify that the prior code version is compatible with the current (expanded) schema before initiating a code rollback. :::