01 — Project Critique, Architectural Risks & Overengineering Review
Status: Draft for review · Audience: founding team · Author: Principal review Covers tasks 1–3: critique the idea, architectural risks, overengineering risks.
This is a pre-development design review. It is deliberately blunt. The goal is to spend disagreement now — on documents — instead of later, on rewrites.
1. Critique of the Project Idea
1.1 What BitVault actually is
Stripped of the buzzwords, BitVault is four products wearing one name:
- File storage — durable bytes, quotas, lifecycle (this is the “easy” part; object stores already solve it).
- File synchronization — multi-device, offline-tolerant, conflict-aware reconciliation (this is the hard part).
- Sharing & collaboration — links, permissions, expiry, external recipients.
- Content management — vague, and the most dangerous word in the brief.
Each of (1)–(3) is a serious system. Dropbox, Box, and Nextcloud are each the output of large teams over many years. Building a credible, deep version of even two of these is an ambitious portfolio project. Building a shallow version of all four is a worse outcome than building one well.
1.2 The honest verdict
As a product: unfocused and over-scoped. “Storage + sync + sharing + CMS, multi-tenant, self-hostable, SaaS, multi-cloud” is not a wedge; it is a wish list. There is no sharp answer to “who is the first user and what one job do we do better than their current tool?”
As a portfolio / platform-engineering showcase: excellent — if disciplined. The problem space legitimately exercises distributed systems, storage, eventing, multi-tenancy, K8s, and observability. The breadth that makes it a weak product makes it a strong learning vehicle. The entire risk is whether the execution is deep or a mile-wide-inch-deep demo.
1.3 The “Content Management” trap (resolve this first)
“Content management platform” is doing too much work in the brief. There are two very different things it could mean:
- A web CMS (structured content types, editorial workflows, publishing, rendering) — e.g. Contentful, Strapi. This is a different product from file sync and would fracture the domain model. Recommendation: cut entirely.
- Digital Asset Management (DAM-lite: rich metadata, tags, versioning, previews/thumbnails, content search over the files you already store) — this is a coherent extension of file storage. Recommendation: this is the intended meaning; rename it “Asset & Metadata Management” and treat it as a later phase.
Decision required: BitVault’s “content management” = DAM-lite over stored files (tags, metadata, versions, previews, search), not a web CMS. The rest of this review assumes that interpretation.
1.4 The sync question is the whole game
Sync is where real distributed-systems credibility lives, and it is the single
feature most clones fake. “Upload + download with a Sync button” is not sync.
Real sync means: a change feed, content identity, delta transfer, offline edits,
and a defensible conflict policy that never silently loses data. If BitVault
nails sync and is honest about everything else, it is a strong project. If it
treats sync as an afterthought, it is one of a thousand S3 wrappers.
Strategic recommendation: make sync correctness the headline competency. Everything else (multi-cloud, K8s, eventing) is supporting cast for that story.
2. Architectural Risks
Ranked by “how badly does this hurt if we get it wrong.”
R1 — Sync correctness & conflict resolution (SEVERITY: critical)
Multi-device offline edits create concurrent divergent histories. The failure mode is silent data loss (last-writer-wins clobbering a user’s work) — the most trust-destroying bug a storage product can ship.
- Mitigation: monotonic per-tenant change journal + content-addressed file identity + explicit conflict policy = create conflicted copies, never overwrite, with version history as the safety net. See ADR-0008.
R2 — The dual-write problem: object store and metadata DB (SEVERITY: critical)
A file is two writes: the bytes (object storage) and the truth (Postgres
row). If they diverge you get orphaned blobs (waste) or dangling references
(404s / corruption). Naive PutObject + INSERT is not atomic.
- Mitigation: commit protocol — upload to a staging key → server verifies (HEAD/size/ETag/hash) → metadata written in a transaction with a transactional outbox row → background finalizer + reference-counted GC for orphans. Postgres is the source of truth; an object with no metadata row does not exist. See ADR-0004, ADR-0006.
R3 — Storage abstraction leakiness (SEVERITY: high)
MinIO, S3, R2, GCS, and Azure Blob differ in consistency, multipart semantics, presigned-URL capabilities, conditional writes, and error taxonomies. A naive “one interface for all” either leaks provider quirks or collapses to a useless lowest-common-denominator.
- Mitigation: a narrow, capability-flagged interface (Put/Get/Head/Delete, multipart, presign, list); per-provider adapters; a conformance test suite every adapter must pass. Ship one adapter first (MinIO/S3). See ADR-0005.
R4 — Multi-tenant isolation (SEVERITY: high — it’s a security boundary)
Logical (shared-DB) multi-tenancy is one forgotten WHERE tenant_id = ? away from
a cross-tenant data breach. App-layer scoping alone is insufficient.
- Mitigation: enforce isolation at the lowest layer with Postgres Row-Level Security, tenant context set per connection/transaction; tenant-scoped object key prefixes; per-tenant rate limits. Documented path to schema/DB-per-tenant for enterprise tenants. See ADR-0007.
R5 — Data-plane scaling & the presigned-URL authz gap (SEVERITY: high)
Proxying file bytes through Go services is a memory/CPU/egress disaster at scale. The fix — direct-to-storage transfer via presigned URLs — bypasses your authz and audit layer.
- Mitigation: never proxy bulk bytes; issue presigned URLs only after an authz
check, scoped tightly (method, exact key,
content-length-range, content-type, short TTL). Accept that the data plane is a deliberate, controlled bypass of the control plane. See ADR-0011.
R6 — Event-driven complexity & eventual consistency (SEVERITY: medium-high)
NATS JetStream is at-least-once. Without discipline you get duplicate processing, out-of-order effects, lost events, and undebuggable flows. Eventual consistency leaks into the UI (a file is uploaded but not yet searchable).
- Mitigation: idempotent consumers (dedup keys), per-aggregate ordering, explicit DLQs, the transactional outbox as the only event source, and end-to-end tracing. Design the UI to expect “indexing…” states. See ADR-0006.
R7 — Search index drift (SEVERITY: medium)
OpenSearch is a derived store; it will drift from Postgres (lag, failures, reindexes).
- Mitigation: treat the index as disposable and rebuildable from the source of truth; event-driven indexing with a reconciliation/backfill job; never read authoritative state from the index. See ADR-0009.
R8 — Dual API surface drift: gRPC internal + REST external (SEVERITY: medium)
Two contracts for the same operations invites divergence and double maintenance.
- Mitigation: protobuf is the single source of truth; generate REST/OpenAPI from it (grpc-gateway or a thin BFF). One schema, two transports. See ADR-0003.
R9 — Large-file & resumable uploads (SEVERITY: medium)
Naive uploads fail on flaky networks and large files; memory pressure if buffered.
- Mitigation: multipart + resumable protocol (tus-style or multipart with
client-tracked parts),
content-length-rangeenforcement, backpressure.
R10 — Operational debuggability (SEVERITY: medium, compounding)
A distributed, event-driven system without correlation is impossible to debug.
- Mitigation: OpenTelemetry from commit #1 — trace IDs propagated across REST → gRPC → NATS; structured logs; RED/USE metrics; SLOs. See ADR-0013.
R11 — Secrets & encryption key management (SEVERITY: medium, high if mishandled)
Per-tenant encryption, provider credentials, signing keys.
- Mitigation: envelope encryption with a KMS/Vault abstraction; no secrets in config files; at-rest encryption v1, E2E is a future track (it breaks server-side search/preview — be honest about the tradeoff). See ADR-0014.
R12 — Cost / egress (SEVERITY: low now, high at scale)
Multi-cloud egress fees; proxying data multiplies cost.
- Mitigation: direct-to-storage (R5), provider co-location, lifecycle tiering.
3. Overengineering Risks
The whole brief reads as premature optimization for scale you do not have. This is the most common way ambitious portfolio projects die: all energy goes into infrastructure, and the actual file-sync product never ships. A Principal’s job is to delete complexity until it hurts, then add back only what a forcing function demands.
The Overengineering Ledger
| Requirement in brief | The risk | v1 prescription | Add complexity when… |
|---|---|---|---|
| Microservices from day one | Distributed monolith: network calls, partial failure, and ops overhead with zero organizational scaling benefit | Modular monolith, one binary (bitvaultd), modules with clean interfaces along bounded-context lines |
A module has a different scaling/deploy profile (e.g. indexer, thumbnailer) or a team owns it independently |
| NATS JetStream everywhere | Async indirection for what are in-process function calls | In-process event bus + transactional outbox; outbox can publish to NATS the day you split | You extract a service and need cross-process async |
| 5 storage providers | Building a 5-way abstraction for hypothetical customers | One capability-flagged interface, MinIO/S3 adapter only | A real deployment target needs R2/GCS/Azure |
| OpenSearch | A JVM cluster to search filenames | Postgres FTS for name/metadata search | Full-text search inside document contents becomes a committed feature |
| 3 deployment models (self-host, SaaS, multi-tenant) | Conflicting constraints multiply work | One artifact, two packagings: Docker Compose (self-host) + Helm (K8s/SaaS), driven by config profiles | A profile’s needs genuinely diverge |
| Service mesh / operators / CRDs | Platform plumbing for traffic you don’t have | Plain K8s + Ingress + NetworkPolicies | You have multiple services needing mTLS/traffic-shaping |
| CRDTs for sync | Solving real-time co-editing — which is a non-goal | Change journal + conflicted-copies + version history | Real-time collaborative editing becomes a goal |
| Client-side E2E encryption | Crypto complexity; breaks server-side search/preview/dedup | At-rest envelope encryption + TLS in transit | A privacy-tier product requirement exists |
| 6 stateful infra deps | Self-host adoption killer (Nextcloud needs ~2) | Tiered deps: Core = Postgres + object store; Standard = +Redis +NATS; Full = +OpenSearch | The tier’s feature is actually enabled |
3.1 The reframe that makes this a stronger portfolio
There is a real tension: the brief lists microservices, NATS, K8s, and multi-cloud because the point is to demonstrate them. Cutting them to a monolith seems to defeat the purpose. It does not. The resolution:
Microservices is the target architecture. The modular monolith is the starting architecture. The migration between them is the portfolio centerpiece.
To a senior reviewer, “I shipped 12 microservices for an app with no traffic” reads as cargo-culting. “I built a modular monolith with strict bounded-context boundaries, then extracted the search-indexer and storage-worker into services when the async/scaling profile justified it — here is the strangler-fig migration, the outbox that made it safe, and the traces that proved it worked” reads as a Principal. The discipline is the demonstration.
You still get to show every technology on the list. You show them in the order a competent team would actually adopt them, each with a written justification (the ADRs). That narrative is rarer and more valuable than the tech itself.
3.2 The one-sentence guardrail
Every piece of infrastructure must be justified by a demonstrated need (a metric, a load test, a real deployment target) or an explicit portfolio learning goal recorded in an ADR — never by “we might need it.”
4. Summary of Recommendations
- Cut the web-CMS. Redefine “content management” as DAM-lite over stored files.
- Make sync correctness the headline. It is the hard, credible, differentiating part.
- Start as a modular monolith (
bitvaultd), bounded-context modules, extractable. - Tier the infrastructure. Postgres + object store is the core; everything else is opt-in.
- Ship one storage adapter (MinIO/S3) behind a capability-flagged interface.
- Solve dual-write deliberately (commit protocol + outbox + ref-counted GC) from day one — this is correctness, not optimization.
- Enforce tenant isolation in the database (RLS), not just app code.
- Direct-to-storage data plane via tightly-scoped presigned URLs.
- OpenTelemetry from commit #1. Observability is not a later phase here.
- Treat the monolith→services extraction as the deliverable, documented via ADRs and traces.
See 09-evolution-roadmap.md for the phased plan that
operationalizes these recommendations, and ../adr/ for the decisions.