08 — Delta Sync, Incremental Uploads & Large Files
Topics: delta synchronization, incremental uploads. Answers: How should large files be synchronized? Reuses the storage subsystem’s content-addressed chunks, manifests, and per-tenant dedup (storage/02, storage/03, storage/05).
1. The principle: never move bytes you already have
Both sides represent a file as a manifest of content-addressed CDC chunks (storage/02). The delta between any two versions is simply the set difference of chunk hashes — computed independently on each side, no rolling-checksum exchange (contrast rsync, 01 §6). Editing one region of a 1 GiB file re-chunks only that region (CDC is insert-robust), so ~one chunk moves, not 1 GiB.
2. Incremental upload (delta push)
sequenceDiagram
autonumber
participant E as Engine
participant GW as Gateway
participant ST as Storage
participant O as Object storage
E->>E: stream file → FastCDC → BLAKE3 per chunk
E->>GW: NegotiateChunks([h1..hn]) (tenant-scoped, ADR-0018)
GW->>ST: which exist in tenant?
ST-->>GW: missing = [h3,h7]
GW-->>E: presigned PUT for [h3,h7] only
E-->>O: PUT h3, h7 (direct, resumable, ADR-0011/0021)
E->>GW: Commit(node, base_version, manifest=[h1..hn])
GW-->>E: committed (new version) → advance Synced
- Only new chunks upload → incremental by construction.
- Unchanged chunks (
h1,h2,h4..) are already in the tenant (this file’s prior version, another file, another device) → not re-sent = dedup + delta in one step. - Resumable: committed chunks persist; an interrupted upload resumes by re-negotiating (already-uploaded chunks now report present) (storage/05 §5).
3. Incremental download (delta pull)
sequenceDiagram
autonumber
participant E as Engine
participant GW as Gateway
participant DB as Local chunk cache ([03])
participant O as Object storage
E->>GW: ResolveVersion(node) → manifest=[h1..hn]
GW-->>E: manifest
E->>DB: which chunks already local? (prior versions, other files)
DB-->>E: have [h1,h2,h4..] — missing [h3,h7]
E->>GW: presigned GET for [h3,h7] (range reads over packs, storage/06)
E-->>O: GET h3, h7
E->>E: reconstruct in temp from local + fetched → BLAKE3 verify
E->>E: fsync + atomic rename into place ([03], ADR-0027)
- The local chunk cache (03) means a download often fetches only the changed chunks and reconstructs the rest from disk — the same property that makes “move a file you already have on another device” nearly free.
- Reconstruction is atomic: temp file → fsync → rename; a partial file is never visible, and a crash leaves only a discardable temp.
4. Large files specifically
| Concern | Design |
|---|---|
| Memory | stream chunk-by-chunk; never load the whole file (hash, upload, reconstruct all streaming) |
| Resumability | committed chunks are durable; resume = re-negotiate (storage/05) |
| Atomicity | reconstruct in a temp file, verify full content hash, atomic rename (03) |
| Still-being-written | a file whose size/mtime is changing is deferred (settle period, 04); don’t sync a half-written file |
| Append-heavy (logs) | CDC keeps earlier chunks stable; only tail chunks are new → cheap incremental |
| Bandwidth | bounded parallel chunk transfers + configurable throttle; pause/resume (10) |
| Verification | BLAKE3 per chunk + whole-content root (storage/04) |
| Provider limits | multipart for very large single objects (5 MiB–5 GiB parts, ≤10k) (storage/05) |
5. LAN / peer transfer (future, optional)
Because chunks are content-addressed and verifiable, a chunk may be fetched from any source — including a peer device on the same LAN — and proven correct by hash. A future LAN-sync mode lets co-located devices exchange chunks directly (Dropbox “LAN Sync”), cutting WAN egress and latency. Designed-for (content addressing makes it safe), not built in v1.
6. Tradeoffs / Alternatives / Scaling
Tradeoffs. Delta sync adds the chunk index + manifest indirection and a negotiate round-trip per upload; the bandwidth savings on real workloads (edits, re-saves, shared files, moves) dwarf it. Chunk size trades dedup/delta granularity against metadata cardinality — settled at ~1 MiB CDC (storage/02, ADR-0017).
Alternatives considered.
- rsync rolling-checksum delta: the right tool when you can’t re-chunk identically on both ends; unnecessary here because CAS gives independent, exchange-free deltas (01 §6).
- Whole-file transfer (no delta): trivial but catastrophic for large frequently-edited files; rejected except for tiny files (stored whole anyway, storage/05).
- Fixed-size blocks (Dropbox 4 MiB): simpler but breaks on inserts; CDC chosen (ADR-0017).
Scaling concerns.
- Request amplification (many chunk GETs) → pack-range coalescing + CDN (storage/06).
- Negotiate fan-out → batched hashes + Bloom pre-filter (storage/03).
- Metered/mobile networks → throttle, Wi-Fi-only option, defer large transfers (10).
- Many large files in flight → bounded transfer pool + priority (small/interactive first) so one huge upload doesn’t starve everything.
References
- Dropbox Streaming File Synchronization (block delta): https://dropbox.tech/infrastructure/streaming-file-synchronization
- rsync technical report: https://www.samba.org/rsync/tech_report/