01 — Project Critique, Architectural Risks & Overengineering Review

Status: Draft for review · Audience: founding team · Author: Principal review Covers tasks 1–3: critique the idea, architectural risks, overengineering risks.

This is a pre-development design review. It is deliberately blunt. The goal is to spend disagreement now — on documents — instead of later, on rewrites.

1. Critique of the Project Idea

1.1 What BitVault actually is

Stripped of the buzzwords, BitVault is four products wearing one name:

File storage — durable bytes, quotas, lifecycle (this is the “easy” part; object stores already solve it).
File synchronization — multi-device, offline-tolerant, conflict-aware reconciliation (this is the hard part).
Sharing & collaboration — links, permissions, expiry, external recipients.
Content management — vague, and the most dangerous word in the brief.

Each of (1)–(3) is a serious system. Dropbox, Box, and Nextcloud are each the output of large teams over many years. Building a credible, deep version of even two of these is an ambitious portfolio project. Building a shallow version of all four is a worse outcome than building one well.

1.2 The honest verdict

As a product: unfocused and over-scoped. “Storage + sync + sharing + CMS, multi-tenant, self-hostable, SaaS, multi-cloud” is not a wedge; it is a wish list. There is no sharp answer to “who is the first user and what one job do we do better than their current tool?”

As a portfolio / platform-engineering showcase: excellent — if disciplined. The problem space legitimately exercises distributed systems, storage, eventing, multi-tenancy, K8s, and observability. The breadth that makes it a weak product makes it a strong learning vehicle. The entire risk is whether the execution is deep or a mile-wide-inch-deep demo.

1.3 The “Content Management” trap (resolve this first)

“Content management platform” is doing too much work in the brief. There are two very different things it could mean:

A web CMS (structured content types, editorial workflows, publishing, rendering) — e.g. Contentful, Strapi. This is a different product from file sync and would fracture the domain model. Recommendation: cut entirely.
Digital Asset Management (DAM-lite: rich metadata, tags, versioning, previews/thumbnails, content search over the files you already store) — this is a coherent extension of file storage. Recommendation: this is the intended meaning; rename it “Asset & Metadata Management” and treat it as a later phase.

Decision required: BitVault’s “content management” = DAM-lite over stored files (tags, metadata, versions, previews, search), not a web CMS. The rest of this review assumes that interpretation.

1.4 The sync question is the whole game

Sync is where real distributed-systems credibility lives, and it is the single feature most clones fake. “Upload + download with a Sync button” is not sync. Real sync means: a change feed, content identity, delta transfer, offline edits, and a defensible conflict policy that never silently loses data. If BitVault nails sync and is honest about everything else, it is a strong project. If it treats sync as an afterthought, it is one of a thousand S3 wrappers.

Strategic recommendation: make sync correctness the headline competency. Everything else (multi-cloud, K8s, eventing) is supporting cast for that story.

2. Architectural Risks

Ranked by “how badly does this hurt if we get it wrong.”

R1 — Sync correctness & conflict resolution (SEVERITY: critical)

Multi-device offline edits create concurrent divergent histories. The failure mode is silent data loss (last-writer-wins clobbering a user’s work) — the most trust-destroying bug a storage product can ship.

Mitigation: monotonic per-tenant change journal + content-addressed file identity + explicit conflict policy = create conflicted copies, never overwrite, with version history as the safety net. See ADR-0008.

R2 — The dual-write problem: object store and metadata DB (SEVERITY: critical)

A file is two writes: the bytes (object storage) and the truth (Postgres row). If they diverge you get orphaned blobs (waste) or dangling references (404s / corruption). Naive PutObject + INSERT is not atomic.

Mitigation: commit protocol — upload to a staging key → server verifies (HEAD/size/ETag/hash) → metadata written in a transaction with a transactional outbox row → background finalizer + reference-counted GC for orphans. Postgres is the source of truth; an object with no metadata row does not exist. See ADR-0004, ADR-0006.

R3 — Storage abstraction leakiness (SEVERITY: high)

MinIO, S3, R2, GCS, and Azure Blob differ in consistency, multipart semantics, presigned-URL capabilities, conditional writes, and error taxonomies. A naive “one interface for all” either leaks provider quirks or collapses to a useless lowest-common-denominator.

Mitigation: a narrow, capability-flagged interface (Put/Get/Head/Delete, multipart, presign, list); per-provider adapters; a conformance test suite every adapter must pass. Ship one adapter first (MinIO/S3). See ADR-0005.

R4 — Multi-tenant isolation (SEVERITY: high — it’s a security boundary)

Logical (shared-DB) multi-tenancy is one forgotten WHERE tenant_id = ? away from a cross-tenant data breach. App-layer scoping alone is insufficient.

Mitigation: enforce isolation at the lowest layer with Postgres Row-Level Security, tenant context set per connection/transaction; tenant-scoped object key prefixes; per-tenant rate limits. Documented path to schema/DB-per-tenant for enterprise tenants. See ADR-0007.

R5 — Data-plane scaling & the presigned-URL authz gap (SEVERITY: high)

Proxying file bytes through Go services is a memory/CPU/egress disaster at scale. The fix — direct-to-storage transfer via presigned URLs — bypasses your authz and audit layer.

Mitigation: never proxy bulk bytes; issue presigned URLs only after an authz check, scoped tightly (method, exact key, content-length-range, content-type, short TTL). Accept that the data plane is a deliberate, controlled bypass of the control plane. See ADR-0011.

R6 — Event-driven complexity & eventual consistency (SEVERITY: medium-high)

NATS JetStream is at-least-once. Without discipline you get duplicate processing, out-of-order effects, lost events, and undebuggable flows. Eventual consistency leaks into the UI (a file is uploaded but not yet searchable).

Mitigation: idempotent consumers (dedup keys), per-aggregate ordering, explicit DLQs, the transactional outbox as the only event source, and end-to-end tracing. Design the UI to expect “indexing…” states. See ADR-0006.

R7 — Search index drift (SEVERITY: medium)

OpenSearch is a derived store; it will drift from Postgres (lag, failures, reindexes).

Mitigation: treat the index as disposable and rebuildable from the source of truth; event-driven indexing with a reconciliation/backfill job; never read authoritative state from the index. See ADR-0009.

R8 — Dual API surface drift: gRPC internal + REST external (SEVERITY: medium)

Two contracts for the same operations invites divergence and double maintenance.

Mitigation: protobuf is the single source of truth; generate REST/OpenAPI from it (grpc-gateway or a thin BFF). One schema, two transports. See ADR-0003.

R9 — Large-file & resumable uploads (SEVERITY: medium)

Naive uploads fail on flaky networks and large files; memory pressure if buffered.

Mitigation: multipart + resumable protocol (tus-style or multipart with client-tracked parts), content-length-range enforcement, backpressure.

R10 — Operational debuggability (SEVERITY: medium, compounding)

A distributed, event-driven system without correlation is impossible to debug.

Mitigation: OpenTelemetry from commit #1 — trace IDs propagated across REST → gRPC → NATS; structured logs; RED/USE metrics; SLOs. See ADR-0013.

R11 — Secrets & encryption key management (SEVERITY: medium, high if mishandled)

Per-tenant encryption, provider credentials, signing keys.

Mitigation: envelope encryption with a KMS/Vault abstraction; no secrets in config files; at-rest encryption v1, E2E is a future track (it breaks server-side search/preview — be honest about the tradeoff). See ADR-0014.

R12 — Cost / egress (SEVERITY: low now, high at scale)

Multi-cloud egress fees; proxying data multiplies cost.

Mitigation: direct-to-storage (R5), provider co-location, lifecycle tiering.

3. Overengineering Risks

The whole brief reads as premature optimization for scale you do not have. This is the most common way ambitious portfolio projects die: all energy goes into infrastructure, and the actual file-sync product never ships. A Principal’s job is to delete complexity until it hurts, then add back only what a forcing function demands.

The Overengineering Ledger

Requirement in brief	The risk	v1 prescription	Add complexity when…
Microservices from day one	Distributed monolith: network calls, partial failure, and ops overhead with zero organizational scaling benefit	Modular monolith, one binary (`bitvaultd`), modules with clean interfaces along bounded-context lines	A module has a different scaling/deploy profile (e.g. indexer, thumbnailer) or a team owns it independently
NATS JetStream everywhere	Async indirection for what are in-process function calls	In-process event bus + transactional outbox; outbox can publish to NATS the day you split	You extract a service and need cross-process async
5 storage providers	Building a 5-way abstraction for hypothetical customers	One capability-flagged interface, MinIO/S3 adapter only	A real deployment target needs R2/GCS/Azure
OpenSearch	A JVM cluster to search filenames	Postgres FTS for name/metadata search	Full-text search inside document contents becomes a committed feature
3 deployment models (self-host, SaaS, multi-tenant)	Conflicting constraints multiply work	One artifact, two packagings: Docker Compose (self-host) + Helm (K8s/SaaS), driven by config profiles	A profile’s needs genuinely diverge
Service mesh / operators / CRDs	Platform plumbing for traffic you don’t have	Plain K8s + Ingress + NetworkPolicies	You have multiple services needing mTLS/traffic-shaping
CRDTs for sync	Solving real-time co-editing — which is a non-goal	Change journal + conflicted-copies + version history	Real-time collaborative editing becomes a goal
Client-side E2E encryption	Crypto complexity; breaks server-side search/preview/dedup	At-rest envelope encryption + TLS in transit	A privacy-tier product requirement exists
6 stateful infra deps	Self-host adoption killer (Nextcloud needs ~2)	Tiered deps: Core = Postgres + object store; Standard = +Redis +NATS; Full = +OpenSearch	The tier’s feature is actually enabled

3.1 The reframe that makes this a stronger portfolio

There is a real tension: the brief lists microservices, NATS, K8s, and multi-cloud because the point is to demonstrate them. Cutting them to a monolith seems to defeat the purpose. It does not. The resolution:

Microservices is the target architecture. The modular monolith is the starting architecture. The migration between them is the portfolio centerpiece.

To a senior reviewer, “I shipped 12 microservices for an app with no traffic” reads as cargo-culting. “I built a modular monolith with strict bounded-context boundaries, then extracted the search-indexer and storage-worker into services when the async/scaling profile justified it — here is the strangler-fig migration, the outbox that made it safe, and the traces that proved it worked” reads as a Principal. The discipline is the demonstration.

You still get to show every technology on the list. You show them in the order a competent team would actually adopt them, each with a written justification (the ADRs). That narrative is rarer and more valuable than the tech itself.

3.2 The one-sentence guardrail

Every piece of infrastructure must be justified by a demonstrated need (a metric, a load test, a real deployment target) or an explicit portfolio learning goal recorded in an ADR — never by “we might need it.”

4. Summary of Recommendations

Cut the web-CMS. Redefine “content management” as DAM-lite over stored files.
Make sync correctness the headline. It is the hard, credible, differentiating part.
Start as a modular monolith (bitvaultd), bounded-context modules, extractable.
Tier the infrastructure. Postgres + object store is the core; everything else is opt-in.
Ship one storage adapter (MinIO/S3) behind a capability-flagged interface.
Solve dual-write deliberately (commit protocol + outbox + ref-counted GC) from day one — this is correctness, not optimization.
Enforce tenant isolation in the database (RLS), not just app code.
Direct-to-storage data plane via tightly-scoped presigned URLs.
OpenTelemetry from commit #1. Observability is not a later phase here.
Treat the monolith→services extraction as the deliverable, documented via ADRs and traces.

See 09-evolution-roadmap.md for the phased plan that operationalizes these recommendations, and ../adr/ for the decisions.