GitOps with ArgoCD
BitVault uses a pull-based GitOps model (ADR-0028). The cluster’s desired state lives in Git. ArgoCD continuously reconciles the live cluster against it. GitHub Actions never holds a kubeconfig — CI’s authority ends at the registry and the GitOps repository.
The Deployment Loop
flowchart LR
push["Code push\nto main / tag"]
ci["GitHub Actions\nbuild + sign\n(OIDC auth)"]
registry["OCI Registry\n(signed image + SBOM\n+ provenance)"]
pr["GitOps PR\n(digest bump)"]
gitops["GitOps Repo\n(desired state)"]
argocd["ArgoCD\n(pull + render)"]
cluster["Kubernetes\nCluster"]
push --> ci --> registry
ci --> pr --> gitops
argocd -->|"watches"| gitops
argocd -->|"applies"| cluster
CI’s scope: build → sign → push → open PR. Cluster state changes are exclusively ArgoCD’s responsibility.
ArgoCD Application Structure
BitVault manages cluster resources through two ArgoCD object types:
Application — one per environment per service group (or one per environment for the monolith phase):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: bitvault-prod
namespace: argocd
spec:
project: bitvault
source:
repoURL: https://github.com/bitvault/gitops
targetRevision: main
path: envs/prod
helm:
valueFiles:
- values.prod.yaml
destination:
server: https://kubernetes.default.svc
namespace: bitvault-prod
syncPolicy:
automated:
prune: true
selfHeal: true
ApplicationSet — used for matrix generation across environments and services, and for ephemeral preview environments:
- Env × service matrix: Generates one
Applicationper(environment, service)combination from a list generator. - PR preview generator: Uses the ArgoCD
pullRequestgenerator to createbitvault-preview-pr-Napplications for each open PR automatically.
Sync waves control ordering within a single sync operation (see Sync Waves below).
Sync Waves
Resources within an ArgoCD Application sync in wave order. Resources in the same wave sync in parallel; the next wave does not start until all resources in the current wave are Healthy.
| Wave | Resource Types | Rationale |
|---|---|---|
| -3 | Namespaces, CRDs | Must exist before anything else can be created |
| -2 | RBAC (ClusterRole, RoleBinding), ServiceAccounts | Required by operators and workloads |
| -1 | Operators (external-secrets, cert-manager, KEDA controllers), ExternalSecret resources | Must be running before app config secrets can be materialized |
| 0 | Data services (PostgreSQL, Redis, NATS), DB migration PreSync hook Jobs | Migration Job runs as a PreSync hook and must succeed before wave 0 resources are created |
| 1 | bitvaultd, bitvault-worker, bitvault-web Deployments + Services |
App workloads start after data services are ready and migrations complete |
| 2 | Ingress / HTTPRoute, HPA, KEDA ScaledObjects, PodDisruptionBudgets | Traffic routing and scaling activated after workloads are healthy |
The DB migration Job uses the ArgoCD PreSync hook annotation, ensuring it runs (and passes) before any wave-0 resources are touched:
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
A failed migration Job blocks the entire sync. This is intentional — a failed migration must not be followed by a code deployment.
Promotion Flow
flowchart TD
ci["CI builds signed\ndigest D on main merge"]
gpr_dev["GitOps PR:\nbump digest in envs/dev/\n(auto-merge bot)"]
dev["ArgoCD auto-syncs\ndev cluster"]
gpr_staging["GitOps PR:\nbump digest in envs/staging/\n(requires PR approval)"]
staging["ArgoCD syncs\nstaging cluster\n(abbreviated canary)"]
gpr_prod["GitOps PR:\nbump digest in envs/prod/\n(requires PR approval\n+ manual gate)"]
prod["ArgoCD syncs\nprod cluster\n(full canary + analysis)"]
ci --> gpr_dev --> dev --> gpr_staging
gpr_staging --> staging --> gpr_prod --> prod
- dev: The CI bot auto-approves and merges the GitOps PR immediately after the image is pushed. ArgoCD’s
automated.selfHealsyncs within seconds. - staging: A human engineer reviews and approves the GitOps PR. No time-gating beyond PR review.
- prod: Two gates — PR approval and a separate manual approval step in the GitHub Actions workflow. After both gates, ArgoCD takes over with the configured Argo Rollouts canary strategy.
Drift Detection
ArgoCD continuously compares the desired state (rendered Helm chart from the GitOps repo commit) against the live state of the cluster:
- If a resource drifts (e.g., someone
kubectl apply-ed a change directly), ArgoCD detects the diff on its next refresh cycle (default: 3 minutes; immediate on webhook push). - With
selfHeal: true, ArgoCD automatically reverts the drift by re-applying the desired state. - Drift events are surfaced as ArgoCD
SyncStatuswarnings and trigger an alert via the observability stack (Prometheus → Alertmanager → PagerDuty/Slack).
Direct kubectl edits to production resources are therefore both reverted and alerted on. This is a feature, not a bug — the GitOps repo is the single source of truth.
Rollback
Rollback is a Git operation, not a kubectl operation:
- Identify the last-known-good commit in the GitOps repo (the one with the previous
image.digestvalue). - Open a PR reverting the digest bump commit, or directly revert and push to the environment branch.
- ArgoCD detects the change and re-syncs the cluster to the reverted digest.
- The previously running pods (which are still in the registry by their immutable digest) are redeployed.
For canary-in-progress failures, Argo Rollouts aborts and rolls back automatically without any manual Git revert. See Rollout Strategy for details.
:::note
There is no helm rollback in this workflow. The Helm release history in the cluster is not the source of truth — the GitOps repo commit history is. A helm rollback would be overwritten by ArgoCD’s next reconciliation cycle anyway.
:::