08 — Delta Sync, Incremental Uploads & Large Files

Topics: delta synchronization, incremental uploads. Answers: How should large files be synchronized? Reuses the storage subsystem’s content-addressed chunks, manifests, and per-tenant dedup (storage/02, storage/03, storage/05).

1. The principle: never move bytes you already have

Both sides represent a file as a manifest of content-addressed CDC chunks (storage/02). The delta between any two versions is simply the set difference of chunk hashes — computed independently on each side, no rolling-checksum exchange (contrast rsync, 01 §6). Editing one region of a 1 GiB file re-chunks only that region (CDC is insert-robust), so ~one chunk moves, not 1 GiB.

2. Incremental upload (delta push)

sequenceDiagram
    autonumber
    participant E as Engine
    participant GW as Gateway
    participant ST as Storage
    participant O as Object storage
    E->>E: stream file → FastCDC → BLAKE3 per chunk
    E->>GW: NegotiateChunks([h1..hn]) (tenant-scoped, ADR-0018)
    GW->>ST: which exist in tenant?
    ST-->>GW: missing = [h3,h7]
    GW-->>E: presigned PUT for [h3,h7] only
    E-->>O: PUT h3, h7 (direct, resumable, ADR-0011/0021)
    E->>GW: Commit(node, base_version, manifest=[h1..hn])
    GW-->>E: committed (new version) → advance Synced

Only new chunks upload → incremental by construction.
Unchanged chunks (h1,h2,h4..) are already in the tenant (this file’s prior version, another file, another device) → not re-sent = dedup + delta in one step.
Resumable: committed chunks persist; an interrupted upload resumes by re-negotiating (already-uploaded chunks now report present) (storage/05 §5).

3. Incremental download (delta pull)

sequenceDiagram
    autonumber
    participant E as Engine
    participant GW as Gateway
    participant DB as Local chunk cache ([03])
    participant O as Object storage
    E->>GW: ResolveVersion(node) → manifest=[h1..hn]
    GW-->>E: manifest
    E->>DB: which chunks already local? (prior versions, other files)
    DB-->>E: have [h1,h2,h4..] — missing [h3,h7]
    E->>GW: presigned GET for [h3,h7] (range reads over packs, storage/06)
    E-->>O: GET h3, h7
    E->>E: reconstruct in temp from local + fetched → BLAKE3 verify
    E->>E: fsync + atomic rename into place ([03], ADR-0027)

The local chunk cache (03) means a download often fetches only the changed chunks and reconstructs the rest from disk — the same property that makes “move a file you already have on another device” nearly free.
Reconstruction is atomic: temp file → fsync → rename; a partial file is never visible, and a crash leaves only a discardable temp.

4. Large files specifically

Concern	Design
Memory	stream chunk-by-chunk; never load the whole file (hash, upload, reconstruct all streaming)
Resumability	committed chunks are durable; resume = re-negotiate (storage/05)
Atomicity	reconstruct in a temp file, verify full content hash, atomic rename (03)
Still-being-written	a file whose size/mtime is changing is deferred (settle period, 04); don’t sync a half-written file
Append-heavy (logs)	CDC keeps earlier chunks stable; only tail chunks are new → cheap incremental
Bandwidth	bounded parallel chunk transfers + configurable throttle; pause/resume (10)
Verification	BLAKE3 per chunk + whole-content root (storage/04)
Provider limits	multipart for very large single objects (5 MiB–5 GiB parts, ≤10k) (storage/05)

5. LAN / peer transfer (future, optional)

Because chunks are content-addressed and verifiable, a chunk may be fetched from any source — including a peer device on the same LAN — and proven correct by hash. A future LAN-sync mode lets co-located devices exchange chunks directly (Dropbox “LAN Sync”), cutting WAN egress and latency. Designed-for (content addressing makes it safe), not built in v1.

6. Tradeoffs / Alternatives / Scaling

Tradeoffs. Delta sync adds the chunk index + manifest indirection and a negotiate round-trip per upload; the bandwidth savings on real workloads (edits, re-saves, shared files, moves) dwarf it. Chunk size trades dedup/delta granularity against metadata cardinality — settled at ~1 MiB CDC (storage/02, ADR-0017).

Alternatives considered.

rsync rolling-checksum delta: the right tool when you can’t re-chunk identically on both ends; unnecessary here because CAS gives independent, exchange-free deltas (01 §6).
Whole-file transfer (no delta): trivial but catastrophic for large frequently-edited files; rejected except for tiny files (stored whole anyway, storage/05).
Fixed-size blocks (Dropbox 4 MiB): simpler but breaks on inserts; CDC chosen (ADR-0017).

Scaling concerns.

Request amplification (many chunk GETs) → pack-range coalescing + CDN (storage/06).
Negotiate fan-out → batched hashes + Bloom pre-filter (storage/03).
Metered/mobile networks → throttle, Wi-Fi-only option, defer large transfers (10).
Many large files in flight → bounded transfer pool + priority (small/interactive first) so one huge upload doesn’t starve everything.

References

Dropbox Streaming File Synchronization (block delta): https://dropbox.tech/infrastructure/streaming-file-synchronization
rsync technical report: https://www.samba.org/rsync/tech_report/