09 — Evolution Roadmap (Monolith → Services)
Supporting plan that operationalizes 01 §4. This is a sequencing document, not a schedule — phases are gated by demonstrated need, not dates. The thesis: ship correctness in a modular monolith, then extract services with evidence. The extraction is the portfolio centerpiece (ADR-0001).
1. Guiding principles
- Correctness before distribution. Sync, the commit protocol, and tenant isolation must be right before anything is split. A wrong monolith is easier to fix than a wrong distributed system.
- Each phase ships something usable. No phase is pure plumbing.
- Every split has a forcing function (05 §5) and an ADR.
- Observability precedes scale. You cannot extract safely what you cannot trace.
- Demonstrate, then document. Each phase produces an artifact (trace, load test, chaos result) that proves it works — that is the portfolio.
2. The phases
flowchart LR
classDef p0 fill:#e5e7eb,stroke:#6b7280,color:#111827;
classDef p1 fill:#fde68a,stroke:#b45309,color:#111827;
classDef p2 fill:#fed7aa,stroke:#c2410c,color:#111827;
classDef p3 fill:#bfdbfe,stroke:#1d4ed8,color:#111827;
classDef p4 fill:#bbf7d0,stroke:#15803d,color:#111827;
P0["P0 · Walking Skeleton<br/>1 binary, PG+MinIO<br/>upload/download + OTel"]:::p0
P1["P1 · Core Product<br/>namespace, versions, sharing,<br/>commit protocol, RLS, CLI"]:::p1
P2["P2 · Sync<br/>change journal, deltas,<br/>conflict copies, web app"]:::p2
P3["P3 · Async plane<br/>NATS+outbox, search,<br/>notifications, GC worker"]:::p3
P4["P4 · Extraction & Scale<br/>split workers/sync, Helm full,<br/>HPA, load+chaos proof"]:::p4
P5["P5 · Breadth<br/>more storage adapters,<br/>previews, mobile, multi-region*"]:::p4
P0 --> P1 --> P2 --> P3 --> P4 --> P5
P0 — Walking Skeleton (prove the spine)
- Goal: thinnest end-to-end path: authenticate → upload (presigned) → commit →
download, in
bitvaultd, on Postgres + MinIO. - Build:
internal/platform/*(config, db, server, OTel from here), Identity (minimal), Storage (MinIO adapter only), File & Metadata (commit protocol), Gateway (REST↔gRPC). - Proves: the dual-write defense (R2) and that one upload = one trace (NFR-7).
- Deps tier: lite (Postgres + MinIO).
- Artifact: a trace screenshot of the upload commit flow; a chaos test killing the process between PUT and commit, showing GC reclaims the orphan.
P1 — Core Product (make it genuinely useful)
- Goal: real file management for one tenant, multi-user.
- Build: folders/move/copy/rename, versioning, trash, RLS multi-tenancy (ADR-0007), Sharing (internal + public links), RBAC, Go CLI, Postgres-FTS name/metadata search.
- Proves: tenant isolation invariant (I3) via a cross-tenant access test; namespace ops are byte-free (I5).
- Deps tier: lite/standard.
- Artifact: CLI demo; the isolation test in CI.
P2 — Synchronization (the headline)
- Goal: correct multi-device sync.
- Build: change journal, device cursors, delta pull/push, conflict = conflicted copy (ADR-0008), Next.js web app consuming the REST API.
- Proves: the conflict harness (two offline devices edit same file → both versions survive, no silent loss) — the single most important test in the project.
- Deps tier: standard.
- Artifact: the conflict-resolution test + a web UI walkthrough.
Note: P2’s journal is still fed in-process from the File context — NATS is not required yet. The event bus interface (
internal/platform/bus) is in place from P0 with an in-proc implementation, so P3 is a swap, not a rewrite.
P3 — The async derivation plane (add the event-driven story)
- Goal: introduce real eventing where it earns its place.
- Build: NATS JetStream behind the existing bus interface, transactional outbox drainer (ADR-0006), OpenSearch content search (ADR-0009), notifications/webhooks, GC/finalizer worker, usage metering, audit sink. Idempotent consumers + DLQs.
- Proves: at-least-once with idempotency; index is rebuildable (I6); derived stores never block the control plane.
- Deps tier: full.
- Artifact: an end-to-end trace spanning REST→gRPC→NATS→indexer; a “rebuild the search index from the journal” demo.
P4 — Extraction & Scale (the principal-grade demonstration)
- Goal: turn modules into services with evidence, and prove horizontal scale.
- Build: extract the workers (indexer/notifier/GC) and Sync into their
own
cmd/*binaries + deployments (forcing functions from 05 §5); Helm full profile, HPA, PDB, NetworkPolicies, mTLS between services; load tests validating NFR-3/4 SLOs and chaos tests for partial failure. - Proves: data plane scales independently of control plane (R5); extraction changed deployment, not callers (because they already used the gRPC API).
- Artifact: before/after architecture + the load/chaos results — this is the story a senior reviewer wants to see.
P5 — Breadth (only after depth)
- Goal: widen, now that the core is proven.
- Build: additional storage adapters (S3 → R2 → GCS → Azure, each passing the conformance suite, ADR-0005), previews/thumbnails worker, React Native mobile app on the now-stable API, optional multi-region groundwork (NG9 — only if pursued).
- Deps tier: full.
3. What is intentionally deferred at each gate
| Until you have… | Don’t build… |
|---|---|
| a working commit protocol (P0) | NATS, OpenSearch, multiple services |
| correct sync (P2) | previews, mobile, extra adapters |
| an event backbone + outbox (P3) | extracted services |
| traces + load tests (P4) | multi-region, service mesh, operators-for-everything |
| a stable public API (P4) | the mobile app |
This table is the antidote to the Overengineering Ledger (01 §3): each row is a forcing function that unlocks the next layer of complexity.
4. Definition of done for the project-as-portfolio
The project is “complete enough to demonstrate principal-level work” when:
- ✅ Sync conflict harness passes (no silent data loss).
- ✅ Dual-write chaos test passes (no orphans/dangling refs).
- ✅ Cross-tenant isolation test passes at the DB layer.
- ✅ One upload renders as one distributed trace across ≥3 components.
- ✅ Load test shows control-plane horizontal scaling + data-plane independence.
- ✅ At least one module was extracted to a service with a documented forcing function and the change touched deployment, not callers.
- ✅ Every major decision has an ADR; every ADR has consequences honestly stated.
Hitting these matters far more than the count of microservices or storage adapters.