01 — Project Critique, Architectural Risks & Overengineering Review

Status: Draft for review · Audience: founding team · Author: Principal review Covers tasks 1–3: critique the idea, architectural risks, overengineering risks.

This is a pre-development design review. It is deliberately blunt. The goal is to spend disagreement now — on documents — instead of later, on rewrites.


1. Critique of the Project Idea

1.1 What BitVault actually is

Stripped of the buzzwords, BitVault is four products wearing one name:

  1. File storage — durable bytes, quotas, lifecycle (this is the “easy” part; object stores already solve it).
  2. File synchronization — multi-device, offline-tolerant, conflict-aware reconciliation (this is the hard part).
  3. Sharing & collaboration — links, permissions, expiry, external recipients.
  4. Content management — vague, and the most dangerous word in the brief.

Each of (1)–(3) is a serious system. Dropbox, Box, and Nextcloud are each the output of large teams over many years. Building a credible, deep version of even two of these is an ambitious portfolio project. Building a shallow version of all four is a worse outcome than building one well.

1.2 The honest verdict

As a product: unfocused and over-scoped. “Storage + sync + sharing + CMS, multi-tenant, self-hostable, SaaS, multi-cloud” is not a wedge; it is a wish list. There is no sharp answer to “who is the first user and what one job do we do better than their current tool?”

As a portfolio / platform-engineering showcase: excellent — if disciplined. The problem space legitimately exercises distributed systems, storage, eventing, multi-tenancy, K8s, and observability. The breadth that makes it a weak product makes it a strong learning vehicle. The entire risk is whether the execution is deep or a mile-wide-inch-deep demo.

1.3 The “Content Management” trap (resolve this first)

“Content management platform” is doing too much work in the brief. There are two very different things it could mean:

Decision required: BitVault’s “content management” = DAM-lite over stored files (tags, metadata, versions, previews, search), not a web CMS. The rest of this review assumes that interpretation.

1.4 The sync question is the whole game

Sync is where real distributed-systems credibility lives, and it is the single feature most clones fake. “Upload + download with a Sync button” is not sync. Real sync means: a change feed, content identity, delta transfer, offline edits, and a defensible conflict policy that never silently loses data. If BitVault nails sync and is honest about everything else, it is a strong project. If it treats sync as an afterthought, it is one of a thousand S3 wrappers.

Strategic recommendation: make sync correctness the headline competency. Everything else (multi-cloud, K8s, eventing) is supporting cast for that story.


2. Architectural Risks

Ranked by “how badly does this hurt if we get it wrong.”

R1 — Sync correctness & conflict resolution (SEVERITY: critical)

Multi-device offline edits create concurrent divergent histories. The failure mode is silent data loss (last-writer-wins clobbering a user’s work) — the most trust-destroying bug a storage product can ship.

R2 — The dual-write problem: object store and metadata DB (SEVERITY: critical)

A file is two writes: the bytes (object storage) and the truth (Postgres row). If they diverge you get orphaned blobs (waste) or dangling references (404s / corruption). Naive PutObject + INSERT is not atomic.

R3 — Storage abstraction leakiness (SEVERITY: high)

MinIO, S3, R2, GCS, and Azure Blob differ in consistency, multipart semantics, presigned-URL capabilities, conditional writes, and error taxonomies. A naive “one interface for all” either leaks provider quirks or collapses to a useless lowest-common-denominator.

R4 — Multi-tenant isolation (SEVERITY: high — it’s a security boundary)

Logical (shared-DB) multi-tenancy is one forgotten WHERE tenant_id = ? away from a cross-tenant data breach. App-layer scoping alone is insufficient.

R5 — Data-plane scaling & the presigned-URL authz gap (SEVERITY: high)

Proxying file bytes through Go services is a memory/CPU/egress disaster at scale. The fix — direct-to-storage transfer via presigned URLs — bypasses your authz and audit layer.

R6 — Event-driven complexity & eventual consistency (SEVERITY: medium-high)

NATS JetStream is at-least-once. Without discipline you get duplicate processing, out-of-order effects, lost events, and undebuggable flows. Eventual consistency leaks into the UI (a file is uploaded but not yet searchable).

R7 — Search index drift (SEVERITY: medium)

OpenSearch is a derived store; it will drift from Postgres (lag, failures, reindexes).

R8 — Dual API surface drift: gRPC internal + REST external (SEVERITY: medium)

Two contracts for the same operations invites divergence and double maintenance.

R9 — Large-file & resumable uploads (SEVERITY: medium)

Naive uploads fail on flaky networks and large files; memory pressure if buffered.

R10 — Operational debuggability (SEVERITY: medium, compounding)

A distributed, event-driven system without correlation is impossible to debug.

R11 — Secrets & encryption key management (SEVERITY: medium, high if mishandled)

Per-tenant encryption, provider credentials, signing keys.

R12 — Cost / egress (SEVERITY: low now, high at scale)

Multi-cloud egress fees; proxying data multiplies cost.


3. Overengineering Risks

The whole brief reads as premature optimization for scale you do not have. This is the most common way ambitious portfolio projects die: all energy goes into infrastructure, and the actual file-sync product never ships. A Principal’s job is to delete complexity until it hurts, then add back only what a forcing function demands.

The Overengineering Ledger

Requirement in brief The risk v1 prescription Add complexity when…
Microservices from day one Distributed monolith: network calls, partial failure, and ops overhead with zero organizational scaling benefit Modular monolith, one binary (bitvaultd), modules with clean interfaces along bounded-context lines A module has a different scaling/deploy profile (e.g. indexer, thumbnailer) or a team owns it independently
NATS JetStream everywhere Async indirection for what are in-process function calls In-process event bus + transactional outbox; outbox can publish to NATS the day you split You extract a service and need cross-process async
5 storage providers Building a 5-way abstraction for hypothetical customers One capability-flagged interface, MinIO/S3 adapter only A real deployment target needs R2/GCS/Azure
OpenSearch A JVM cluster to search filenames Postgres FTS for name/metadata search Full-text search inside document contents becomes a committed feature
3 deployment models (self-host, SaaS, multi-tenant) Conflicting constraints multiply work One artifact, two packagings: Docker Compose (self-host) + Helm (K8s/SaaS), driven by config profiles A profile’s needs genuinely diverge
Service mesh / operators / CRDs Platform plumbing for traffic you don’t have Plain K8s + Ingress + NetworkPolicies You have multiple services needing mTLS/traffic-shaping
CRDTs for sync Solving real-time co-editing — which is a non-goal Change journal + conflicted-copies + version history Real-time collaborative editing becomes a goal
Client-side E2E encryption Crypto complexity; breaks server-side search/preview/dedup At-rest envelope encryption + TLS in transit A privacy-tier product requirement exists
6 stateful infra deps Self-host adoption killer (Nextcloud needs ~2) Tiered deps: Core = Postgres + object store; Standard = +Redis +NATS; Full = +OpenSearch The tier’s feature is actually enabled

3.1 The reframe that makes this a stronger portfolio

There is a real tension: the brief lists microservices, NATS, K8s, and multi-cloud because the point is to demonstrate them. Cutting them to a monolith seems to defeat the purpose. It does not. The resolution:

Microservices is the target architecture. The modular monolith is the starting architecture. The migration between them is the portfolio centerpiece.

To a senior reviewer, “I shipped 12 microservices for an app with no traffic” reads as cargo-culting. “I built a modular monolith with strict bounded-context boundaries, then extracted the search-indexer and storage-worker into services when the async/scaling profile justified it — here is the strangler-fig migration, the outbox that made it safe, and the traces that proved it worked” reads as a Principal. The discipline is the demonstration.

You still get to show every technology on the list. You show them in the order a competent team would actually adopt them, each with a written justification (the ADRs). That narrative is rarer and more valuable than the tech itself.

3.2 The one-sentence guardrail

Every piece of infrastructure must be justified by a demonstrated need (a metric, a load test, a real deployment target) or an explicit portfolio learning goal recorded in an ADR — never by “we might need it.”


4. Summary of Recommendations

  1. Cut the web-CMS. Redefine “content management” as DAM-lite over stored files.
  2. Make sync correctness the headline. It is the hard, credible, differentiating part.
  3. Start as a modular monolith (bitvaultd), bounded-context modules, extractable.
  4. Tier the infrastructure. Postgres + object store is the core; everything else is opt-in.
  5. Ship one storage adapter (MinIO/S3) behind a capability-flagged interface.
  6. Solve dual-write deliberately (commit protocol + outbox + ref-counted GC) from day one — this is correctness, not optimization.
  7. Enforce tenant isolation in the database (RLS), not just app code.
  8. Direct-to-storage data plane via tightly-scoped presigned URLs.
  9. OpenTelemetry from commit #1. Observability is not a later phase here.
  10. Treat the monolith→services extraction as the deliverable, documented via ADRs and traces.

See 09-evolution-roadmap.md for the phased plan that operationalizes these recommendations, and ../adr/ for the decisions.