05 — Uploads: Multipart vs Chunked vs Resumable
Topics: multipart uploads, chunked uploads, resumable uploads. Decision in ADR-0021. The brief lists these three as if peers; they are not. Disambiguating them is the first job of this doc.
1. Three words, three different layers (stop conflating them)
| Term | What it actually is | Layer | Set by |
|---|---|---|---|
| Chunked | splitting content into content-defined dedup units (~1 MiB) | logical / data model | BitVault CDC (02) |
| Multipart | a provider transfer mechanism for one large object, in parts ≥5 MiB | physical / transfer | provider (S3 multipart) |
| Resumable | a property: survive interruption without re-sending received bytes | behavior | achieved via tracking (parts, offset, or committed chunks) |
A single upload can be all three at once: a 10 GiB file is chunked into ~10k dedup chunks (logical), transferred to the provider via multipart parts of 64 MiB (physical), and the whole thing is resumable because we persist progress. They are orthogonal. The rest of this doc designs each layer and how they compose.
2. Upload modes (BitVault supports two; pick by client capability)
flowchart TB
classDef d fill:#fde68a,stroke:#b45309,color:#111827;
start{"Client type?"}:::d
start -->|"smart (CLI, sync, mobile)"| smart["Mode A: client-side chunked + dedup<br/>(delta sync; upload only new chunks)"]:::d
start -->|"simple (browser, 3rd-party, large single file)"| big["Mode B: whole-object transfer<br/>(provider multipart OR tus) + async server-side chunking"]:::d
smart --> commit["Commit manifest (verify → write → outbox)"]:::d
big --> commit
Mode A — client-side chunked upload (the dedup/delta-sync path)
The high-value path: CDC + dedup happen on the client, so unchanged data never leaves the device.
- Client CDC-chunks, hashes (BLAKE3),
NegotiateChunks→ server returns missing subset (tenant-scoped, 03). - Client uploads only missing chunks, direct-to-storage via presigned PUTs.
- Small chunks → one presigned
PUTeach (batched issuance). - A large chunk (rare with 4 MiB cap) or a batch can use provider multipart.
- Small chunks → one presigned
Commit(manifest)→ server verifies all chunks present + checksums (SI-1), writes manifest, refs++, emits outbox event.
Resumability is intrinsic: committed chunks are durable; an interrupted upload
resumes by re-running NegotiateChunks — already-uploaded chunks are now “present”
and skipped. No special resume protocol needed; content addressing gives
resumability for free.
Mode B — whole-object transfer (the simple/large-file path)
For clients that can’t or shouldn’t chunk (browsers, third-party tools, opaque large files). The object is transferred whole, then chunked/deduped server-side, asynchronously.
Two transfer mechanisms, chosen by environment:
- B1 — direct-to-provider multipart (preferred): client uploads parts straight
to the provider via presigned part URLs; bytes never touch our compute
(ADR-0011). Resumable via the provider’s
ListParts. - B2 — tus (proxied): client uploads to the gateway using the tus protocol
(HTTP
PATCH+Upload-Offset); the gateway streams to the provider. For browsers/CORS/firewall situations where presigned multipart is impractical. Resumable via tusHEAD→Upload-Offset.
After the object lands in staging, an ingest worker CDC-chunks it, dedups against the tenant index, writes the manifest, and schedules packing. The user’s file is available immediately (from the whole staged object); dedup/packing settle asynchronously.
3. Provider multipart: the limits that shape part sizing
S3 (and S3-compatible MinIO/R2) multipart hard limits — load-bearing:
| Limit | Value |
|---|---|
| Min part size | 5 MiB (last part may be smaller) |
| Max part size | 5 GiB |
| Max parts per upload | 10,000 |
| Max object size | 5 TiB |
Part-sizing math (adaptive): part_size = clamp(ceil(file_size / 10000), 5 MiB, 5 GiB),
rounded up to a nice boundary. A 5 TiB file needs ≥512 MiB parts to fit in 10k
parts; a 100 MiB file uses 8–16 MiB parts for good retry granularity. The adapter
exposes provider-specific caps; Placement picks a provider that can hold the object
(09). Azure uses block blobs (≤50k blocks,
≤~4000 MiB each) — the adapter maps part→block.
Chunk ≠ part. Our 1 MiB dedup chunks are far below the 5 MiB multipart minimum. That is fine and expected: multipart parts are a transfer batching of the staged whole object (Mode B) or of a large blob; dedup chunks are logical units extracted from content. In Mode A we usually PUT chunks directly (no multipart) and pack them later.
4. The commit protocol (defeats dual-write — SI-1)
Every mode ends in the same commit, the heart of correctness (refines 06 §4 high-level upload):
sequenceDiagram
autonumber
participant C as Client
participant S as Storage Coordinator
participant O as Object Store
participant DB as Postgres (Chunk/Manifest Index + Outbox)
Note over C,O: bytes already in STAGING (Mode A chunks or Mode B object)
C->>S: Commit(version, manifest|object_ref)
S->>O: Head each chunk/part — size + checksum
O-->>S: verified ✔
alt all present & valid
Note over S,DB: BEGIN TX<br/>upsert chunks (ref++ via edge rows)<br/>insert manifest<br/>insert outbox(NodeChanged/BlobCommitted)<br/>COMMIT
S-->>C: 200 committed
else missing/corrupt
S-->>C: 409 — re-upload listed chunks/parts
end
Invariant: a manifest is durable only after all its bytes are verified present. A crash before commit leaves only staging bytes (refcount 0) → reclaimed by GC (11). No dangling references, ever.
5. Resumability — how each path recovers
| Path | Progress tracked by | Resume action | Cleanup of abandoned |
|---|---|---|---|
| Mode A (chunked) | committed chunks in tenant index | re-NegotiateChunks; skip present |
staging chunks ref=0 → GC after TTL |
| Mode B1 (provider multipart) | provider-side uploaded parts | ListParts → upload missing parts → complete |
provider lifecycle aborts stale MPU (e.g. 7d) + our reconcile |
| Mode B2 (tus) | server Upload-Offset |
HEAD → PATCH from offset |
staging object TTL → GC |
Abandoned-upload reclamation is mandatory at scale: incomplete multipart uploads accrue storage cost silently on S3 until aborted. We set provider lifecycle rules to auto-abort stale MPUs and run a reconciler that aborts/deletes staging older than the TTL — belt and suspenders, because lifecycle rules differ per provider.
6. Backpressure, ordering & integrity on the upload path
- Backpressure: presigned direct-to-storage means the provider absorbs upload throughput, not our compute — the data plane scales independently (ADR-0011). The control plane only does negotiate/commit (small, fast). tus (B2) is the one path that streams through compute → it is rate-limited and memory-bounded (fixed buffer, stream-through, never buffer the whole file).
- Parallelism: chunks/parts upload concurrently with a bounded pool; commit is the single serialization point.
- Integrity: every chunk/part carries a checksum verified at commit (04); the provider also validates its native checksum on PUT.
7. Tradeoffs / Alternatives / Scaling
Tradeoffs.
- Mode A maximizes bandwidth savings and gives free resumability + delta sync, but requires a capable client and applies the dedup side-channel discipline (per-tenant, 03). Mode B is universal and simple but moves full bytes and defers dedup to async ingest (temporary 1× storage until packing/dedup runs).
- Small files: we do not chunk files below a threshold (e.g. ≤1–4 MiB) — they are stored whole (one chunk = one object). Chunking tiny files wastes CPU and bloats the index for no dedup gain, and it makes the browser download path trivial (06).
Alternatives considered.
- Always proxy uploads through our API (no presign): simplest client, but the R5/R12 cost disaster — rejected except as the bounded tus fallback (B2).
- Only provider multipart, no client chunking: loses delta sync and cross-version dedup — rejected; Mode A is core value.
- Client-side global dedup negotiation: the Harnik side channel — rejected (per-tenant only, ADR-0018).
Scaling concerns.
NegotiateChunksfan-out: batched (N hashes per call) + Bloom pre-filter (03) to keep index reads down at high upload concurrency.- Presigned URL issuance is cheap and stateless → scales with control-plane replicas; URLs are short-TTL, single-key, single-method (ADR-0011).
- Staging churn: millions of in-flight uploads ⇒ aggressive staging TTL + GC + provider lifecycle abort, or storage cost balloons.
- Hot tenants: per-tenant rate limits on negotiate/commit protect the shared index; the data plane is unaffected (provider-side).
References
- S3 multipart limits: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
- Aborting incomplete multipart uploads (cost!): https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html
- tus resumable upload protocol: https://tus.io/protocols/resumable-upload
- Azure block blobs: https://learn.microsoft.com/azure/storage/blobs/storage-blobs-introduction