BitVault Platform Engineering — Design

Audience: platform / DevOps / SRE engineers. Scope: how BitVault is built, shipped, run, and recovered — containers, Kubernetes, IaC, GitOps, CI/CD, secrets, releases, backup, and disaster recovery — at production grade, for the SaaS deployment, with self-host parity where it matters.

Builds on ADR-0012 tiered packaging, ADR-0013 observability, ADR-0014 KMS, the storage and sync subsystems, and the high-level deployment topology.

No implementation code — architecture and documentation only. Each decision carries Tradeoffs / Alternatives / Scaling; the contested ones are ADRs 0028–0034.


0. Reading order & task map

# Doc Task(s)
01 Containerization (Docker) Docker
02 Kubernetes namespaces 1. namespaces
03 Environment strategy 3. environments
04 Deployment strategy 2. deployment
05 Helm & config Helm
06 GitOps & ArgoCD 6. GitOps
07 CI/CD & image pipelines 5. image builds, 7. CI/CD
08 Release workflows 8. releases
09 Secrets management 4. secrets
10 Infrastructure (Terraform/OpenTofu) IaC
11 Disaster recovery 9. DR
12 Backup strategies 10. backups

1. Platform principles (the rules)

  1. Git is the source of truth. Desired state — infra, manifests, config — lives in Git. The cluster is reconciled toward Git, never mutated out of band.
  2. CI pushes artifacts; CD pulls state. CI (GitHub Actions) builds, tests, scans, signs, and pushes immutable images — then stops. Deployment is pull-based GitOps (ArgoCD) reconciling from Git. CI never holds a kubeconfig (ADR-0028).
  3. IaC provisions the substrate; GitOps runs everything inside. OpenTofu creates clusters, networks, buckets, KMS, IAM — and bootstraps ArgoCD. ArgoCD then owns all in-cluster resources. A clean, documented boundary (ADR-0031).
  4. Immutable, promoted-by-digest artifacts. An image built once is promoted by digest through environments — never rebuilt per environment (ADR-0032/0034).
  5. No secrets in Git, ever. References in Git; secret material in a KMS/vault, synced by External Secrets Operator (ADR-0030).
  6. Progressive delivery with automated analysis. Canary with metric-based auto-promote/auto-rollback (Argo Rollouts), not big-bang deploys (ADR-0029).
  7. Clusters are cattle. Everything to rebuild a cluster is in IaC + Git, so DR is “re-provision + re-sync + restore data,” not heroics (ADR-0033). This is the single biggest operational payoff of the whole design.
  8. Supply-chain security by default. Keyless signing (cosign/OIDC), SBOMs, SLSA provenance, image scanning, admission-time verification (ADR-0032).
  9. Self-host parity. The same Helm charts power SaaS and self-host; self-host uses a subset (Compose or a lightweight cluster), per the tiered-packaging ADR-0012.

2. The delivery topology (one picture)

flowchart TB
    classDef dev fill:#dbeafe,stroke:#1e40af,color:#111827;
    classDef ci fill:#fde68a,stroke:#b45309,color:#111827;
    classDef git fill:#fbcfe8,stroke:#be185d,color:#111827;
    classDef cd fill:#bbf7d0,stroke:#15803d,color:#111827;
    classDef infra fill:#c7d2fe,stroke:#3730a3,color:#111827;

    dev["Developer → PR → merge"]:::dev
    subgraph APP["App monorepo (ADR-0002)"]
      code["code · Dockerfiles · Helm chart source"]:::dev
    end
    subgraph CI["GitHub Actions (CI = push)"]
      build["build · test · scan · SBOM · sign (cosign/OIDC)"]:::ci
      push["push image (by digest) → registry"]:::ci
      bump["open PR: bump image digest in GitOps repo"]:::ci
    end
    reg[("Container registry<br/>signed images + SBOM")]:::ci
    subgraph GITOPS["GitOps repo (desired state)"]
      apps["ArgoCD Applications / ApplicationSets"]:::git
      vals["per-env Helm values + image digests"]:::git
    end
    subgraph CD["ArgoCD (CD = pull)"]
      argo["reconcile · sync waves · drift/self-heal"]:::cd
      ro["Argo Rollouts: canary + analysis"]:::cd
    end
    subgraph CL["Kubernetes clusters"]
      np["nonprod cluster (dev + staging + previews)"]:::cd
      prod["prod cluster"]:::cd
    end
    tofu["OpenTofu (IaC): clusters · VPC · buckets · KMS · IAM · bootstrap ArgoCD"]:::infra
    cloud[("Cloud substrate")]:::infra

    dev --> code --> build --> push --> reg
    build --> bump --> vals
    apps & vals --> argo --> ro
    ro --> np & prod
    argo -. reads .-> reg
    tofu --> cloud --> np & prod
    tofu -. installs .-> argo

The split is the whole design: the left half (CI) produces a signed artifact and a Git change; the right half (CD/GitOps) converges the cluster to Git. They meet only in Git and the registry — never via a deploy credential handed to CI.


3. Repository strategy (three concerns, separated)

Repo / area Holds Who writes Who reads
bitvault (app monorepo, ADR-0002) source, Dockerfiles, Helm chart source, CI workflows engineers (PRs) CI
bitvault-gitops (config repo) ArgoCD Application/ApplicationSet, per-env values, pinned image digests, ExternalSecret refs CI (digest bumps), platform (config PRs) ArgoCD
bitvault-infra (IaC) OpenTofu modules + per-env stacks, remote-state config platform (PRs + plan/apply) OpenTofu

Keep ArgoCD resources and Kubernetes app manifests in separate repos/areas (industry guidance: do not mix the two). The GitOps repo is the contract between CI and CD; the app repo is where humans work; the IaC repo is the substrate. Promotion = a PR in the GitOps repo (ADR-0034).


4. Environment & cluster topology

flowchart LR
    classDef c fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef n fill:#bbf7d0,stroke:#15803d,color:#111827;
    subgraph NP["Nonprod cluster (cost-shared)"]
      dev["ns: dev"]:::n
      stg["ns: staging"]:::n
      prev["ns: pr-123 (ephemeral preview)"]:::n
      sysn["ns: platform (argocd, eso, ingress, monitoring)"]:::n
    end
    subgraph PR["Prod cluster (isolated blast radius)"]
      app["ns: bitvault"]:::n
      data["ns: bitvault-data (operators)"]:::n
      sysp["ns: platform"]:::n
    end
    NP:::c
    PR:::c

5. The platform stack (what runs in every cluster)

Concern Component
GitOps CD ArgoCD + Argo Rollouts (06, 04)
Ingress / TLS ingress controller + cert-manager (ACME)
Secrets External Secrets Operator ← cloud KMS/Vault (09)
Stateful data operators: CloudNativePG (Postgres), Redis, NATS, OpenSearch (12)
Observability OpenTelemetry Collector + Prometheus/Tempo/Loki (or vendor) (ADR-0013)
Backup/DR Velero + CloudNativePG PITR (11, 12)
Policy/security PodSecurity admission, NetworkPolicies, image-signature verification (02, 07)
Autoscaling HPA (+ optional KEDA for queue depth), Cluster Autoscaler/Karpenter

ADR Decision
0028 Pull-based GitOps with ArgoCD; separate config repo
0029 Progressive delivery with Argo Rollouts (canary + analysis)
0030 External Secrets Operator + cloud KMS
0031 IaC with OpenTofu; IaC/GitOps boundary
0032 GitHub Actions + OIDC keyless + supply-chain security
0033 Backup & DR (Velero + PITR; RTO/RPO targets)
0034 Environment & promotion model (promote by digest via PR)

Inherited: 0001, 0012, 0013, 0014.

References (research grounding)