08 — Delta Sync, Incremental Uploads & Large Files

Topics: delta synchronization, incremental uploads. Answers: How should large files be synchronized? Reuses the storage subsystem’s content-addressed chunks, manifests, and per-tenant dedup (storage/02, storage/03, storage/05).


1. The principle: never move bytes you already have

Both sides represent a file as a manifest of content-addressed CDC chunks (storage/02). The delta between any two versions is simply the set difference of chunk hashes — computed independently on each side, no rolling-checksum exchange (contrast rsync, 01 §6). Editing one region of a 1 GiB file re-chunks only that region (CDC is insert-robust), so ~one chunk moves, not 1 GiB.


2. Incremental upload (delta push)

sequenceDiagram
    autonumber
    participant E as Engine
    participant GW as Gateway
    participant ST as Storage
    participant O as Object storage
    E->>E: stream file → FastCDC → BLAKE3 per chunk
    E->>GW: NegotiateChunks([h1..hn]) (tenant-scoped, ADR-0018)
    GW->>ST: which exist in tenant?
    ST-->>GW: missing = [h3,h7]
    GW-->>E: presigned PUT for [h3,h7] only
    E-->>O: PUT h3, h7 (direct, resumable, ADR-0011/0021)
    E->>GW: Commit(node, base_version, manifest=[h1..hn])
    GW-->>E: committed (new version) → advance Synced

3. Incremental download (delta pull)

sequenceDiagram
    autonumber
    participant E as Engine
    participant GW as Gateway
    participant DB as Local chunk cache ([03])
    participant O as Object storage
    E->>GW: ResolveVersion(node) → manifest=[h1..hn]
    GW-->>E: manifest
    E->>DB: which chunks already local? (prior versions, other files)
    DB-->>E: have [h1,h2,h4..] — missing [h3,h7]
    E->>GW: presigned GET for [h3,h7] (range reads over packs, storage/06)
    E-->>O: GET h3, h7
    E->>E: reconstruct in temp from local + fetched → BLAKE3 verify
    E->>E: fsync + atomic rename into place ([03], ADR-0027)

4. Large files specifically

Concern Design
Memory stream chunk-by-chunk; never load the whole file (hash, upload, reconstruct all streaming)
Resumability committed chunks are durable; resume = re-negotiate (storage/05)
Atomicity reconstruct in a temp file, verify full content hash, atomic rename (03)
Still-being-written a file whose size/mtime is changing is deferred (settle period, 04); don’t sync a half-written file
Append-heavy (logs) CDC keeps earlier chunks stable; only tail chunks are new → cheap incremental
Bandwidth bounded parallel chunk transfers + configurable throttle; pause/resume (10)
Verification BLAKE3 per chunk + whole-content root (storage/04)
Provider limits multipart for very large single objects (5 MiB–5 GiB parts, ≤10k) (storage/05)

5. LAN / peer transfer (future, optional)

Because chunks are content-addressed and verifiable, a chunk may be fetched from any source — including a peer device on the same LAN — and proven correct by hash. A future LAN-sync mode lets co-located devices exchange chunks directly (Dropbox “LAN Sync”), cutting WAN egress and latency. Designed-for (content addressing makes it safe), not built in v1.


6. Tradeoffs / Alternatives / Scaling

Tradeoffs. Delta sync adds the chunk index + manifest indirection and a negotiate round-trip per upload; the bandwidth savings on real workloads (edits, re-saves, shared files, moves) dwarf it. Chunk size trades dedup/delta granularity against metadata cardinality — settled at ~1 MiB CDC (storage/02, ADR-0017).

Alternatives considered.

Scaling concerns.

References