05 — Uploads: Multipart vs Chunked vs Resumable

Topics: multipart uploads, chunked uploads, resumable uploads. Decision in ADR-0021. The brief lists these three as if peers; they are not. Disambiguating them is the first job of this doc.

1. Three words, three different layers (stop conflating them)

Term	What it actually is	Layer	Set by
Chunked	splitting content into content-defined dedup units (~1 MiB)	logical / data model	BitVault CDC (02)
Multipart	a provider transfer mechanism for one large object, in parts ≥5 MiB	physical / transfer	provider (S3 multipart)
Resumable	a property: survive interruption without re-sending received bytes	behavior	achieved via tracking (parts, offset, or committed chunks)

A single upload can be all three at once: a 10 GiB file is chunked into ~10k dedup chunks (logical), transferred to the provider via multipart parts of 64 MiB (physical), and the whole thing is resumable because we persist progress. They are orthogonal. The rest of this doc designs each layer and how they compose.

2. Upload modes (BitVault supports two; pick by client capability)

flowchart TB
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    start{"Client type?"}:::d
    start -->|"smart (CLI, sync, mobile)"| smart["Mode A: client-side chunked + dedup<br/>(delta sync; upload only new chunks)"]:::d
    start -->|"simple (browser, 3rd-party, large single file)"| big["Mode B: whole-object transfer<br/>(provider multipart OR tus) + async server-side chunking"]:::d
    smart --> commit["Commit manifest (verify → write → outbox)"]:::d
    big --> commit

Mode A — client-side chunked upload (the dedup/delta-sync path)

The high-value path: CDC + dedup happen on the client, so unchanged data never leaves the device.

Client CDC-chunks, hashes (BLAKE3), NegotiateChunks → server returns missing subset (tenant-scoped, 03).
Client uploads only missing chunks, direct-to-storage via presigned PUTs.
- Small chunks → one presigned PUT each (batched issuance).
- A large chunk (rare with 4 MiB cap) or a batch can use provider multipart.
Commit(manifest) → server verifies all chunks present + checksums (SI-1), writes manifest, refs++, emits outbox event.

Resumability is intrinsic: committed chunks are durable; an interrupted upload resumes by re-running NegotiateChunks — already-uploaded chunks are now “present” and skipped. No special resume protocol needed; content addressing gives resumability for free.

Mode B — whole-object transfer (the simple/large-file path)

For clients that can’t or shouldn’t chunk (browsers, third-party tools, opaque large files). The object is transferred whole, then chunked/deduped server-side, asynchronously.

Two transfer mechanisms, chosen by environment:

B1 — direct-to-provider multipart (preferred): client uploads parts straight to the provider via presigned part URLs; bytes never touch our compute (ADR-0011). Resumable via the provider’s ListParts.
B2 — tus (proxied): client uploads to the gateway using the tus protocol (HTTP PATCH + Upload-Offset); the gateway streams to the provider. For browsers/CORS/firewall situations where presigned multipart is impractical. Resumable via tus HEAD → Upload-Offset.

After the object lands in staging, an ingest worker CDC-chunks it, dedups against the tenant index, writes the manifest, and schedules packing. The user’s file is available immediately (from the whole staged object); dedup/packing settle asynchronously.

3. Provider multipart: the limits that shape part sizing

S3 (and S3-compatible MinIO/R2) multipart hard limits — load-bearing:

Limit	Value
Min part size	5 MiB (last part may be smaller)
Max part size	5 GiB
Max parts per upload	10,000
Max object size	5 TiB

Part-sizing math (adaptive): part_size = clamp(ceil(file_size / 10000), 5 MiB, 5 GiB), rounded up to a nice boundary. A 5 TiB file needs ≥512 MiB parts to fit in 10k parts; a 100 MiB file uses 8–16 MiB parts for good retry granularity. The adapter exposes provider-specific caps; Placement picks a provider that can hold the object (09). Azure uses block blobs (≤50k blocks, ≤~4000 MiB each) — the adapter maps part→block.

Chunk ≠ part. Our 1 MiB dedup chunks are far below the 5 MiB multipart minimum. That is fine and expected: multipart parts are a transfer batching of the staged whole object (Mode B) or of a large blob; dedup chunks are logical units extracted from content. In Mode A we usually PUT chunks directly (no multipart) and pack them later.

4. The commit protocol (defeats dual-write — SI-1)

Every mode ends in the same commit, the heart of correctness (refines 06 §4 high-level upload):

sequenceDiagram
    autonumber
    participant C as Client
    participant S as Storage Coordinator
    participant O as Object Store
    participant DB as Postgres (Chunk/Manifest Index + Outbox)
    Note over C,O: bytes already in STAGING (Mode A chunks or Mode B object)
    C->>S: Commit(version, manifest|object_ref)
    S->>O: Head each chunk/part — size + checksum
    O-->>S: verified ✔
    alt all present & valid
        Note over S,DB: BEGIN TX<br/>upsert chunks (ref++ via edge rows)<br/>insert manifest<br/>insert outbox(NodeChanged/BlobCommitted)<br/>COMMIT
        S-->>C: 200 committed
    else missing/corrupt
        S-->>C: 409 — re-upload listed chunks/parts
    end

Invariant: a manifest is durable only after all its bytes are verified present. A crash before commit leaves only staging bytes (refcount 0) → reclaimed by GC (11). No dangling references, ever.

5. Resumability — how each path recovers

Path	Progress tracked by	Resume action	Cleanup of abandoned
Mode A (chunked)	committed chunks in tenant index	re-`NegotiateChunks`; skip present	staging chunks ref=0 → GC after TTL
Mode B1 (provider multipart)	provider-side uploaded parts	`ListParts` → upload missing parts → complete	provider lifecycle aborts stale MPU (e.g. 7d) + our reconcile
Mode B2 (tus)	server `Upload-Offset`	`HEAD` → `PATCH` from offset	staging object TTL → GC

Abandoned-upload reclamation is mandatory at scale: incomplete multipart uploads accrue storage cost silently on S3 until aborted. We set provider lifecycle rules to auto-abort stale MPUs and run a reconciler that aborts/deletes staging older than the TTL — belt and suspenders, because lifecycle rules differ per provider.

6. Backpressure, ordering & integrity on the upload path

Backpressure: presigned direct-to-storage means the provider absorbs upload throughput, not our compute — the data plane scales independently (ADR-0011). The control plane only does negotiate/commit (small, fast). tus (B2) is the one path that streams through compute → it is rate-limited and memory-bounded (fixed buffer, stream-through, never buffer the whole file).
Parallelism: chunks/parts upload concurrently with a bounded pool; commit is the single serialization point.
Integrity: every chunk/part carries a checksum verified at commit (04); the provider also validates its native checksum on PUT.

7. Tradeoffs / Alternatives / Scaling

Tradeoffs.

Mode A maximizes bandwidth savings and gives free resumability + delta sync, but requires a capable client and applies the dedup side-channel discipline (per-tenant, 03). Mode B is universal and simple but moves full bytes and defers dedup to async ingest (temporary 1× storage until packing/dedup runs).
Small files: we do not chunk files below a threshold (e.g. ≤1–4 MiB) — they are stored whole (one chunk = one object). Chunking tiny files wastes CPU and bloats the index for no dedup gain, and it makes the browser download path trivial (06).

Alternatives considered.

Always proxy uploads through our API (no presign): simplest client, but the R5/R12 cost disaster — rejected except as the bounded tus fallback (B2).
Only provider multipart, no client chunking: loses delta sync and cross-version dedup — rejected; Mode A is core value.
Client-side global dedup negotiation: the Harnik side channel — rejected (per-tenant only, ADR-0018).

Scaling concerns.

NegotiateChunks fan-out: batched (N hashes per call) + Bloom pre-filter (03) to keep index reads down at high upload concurrency.
Presigned URL issuance is cheap and stateless → scales with control-plane replicas; URLs are short-TTL, single-key, single-method (ADR-0011).
Staging churn: millions of in-flight uploads ⇒ aggressive staging TTL + GC + provider lifecycle abort, or storage cost balloons.
Hot tenants: per-tenant rate limits on negotiate/commit protect the shared index; the data plane is unaffected (provider-side).

References

S3 multipart limits: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
Aborting incomplete multipart uploads (cost!): https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html
tus resumable upload protocol: https://tus.io/protocols/resumable-upload
Azure block blobs: https://learn.microsoft.com/azure/storage/blobs/storage-blobs-introduction