06 — Sync State Machines
Deliverable: the sync state machine. Three machines compose the engine: the per-node lifecycle (the core one), the operation/queue-item lifecycle, and the engine lifecycle. Explicit states make the invariants — especially atomicity and “never lose data” — checkable.
1. Per-node sync state machine (the core)
Each node moves through states driven by observations (watcher/scan, cursor delta)
and operation completions. Synced is the stable resting state where S == R == L.
stateDiagram-v2
[*] --> Synced: S == R == L
Synced --> LocalDirty: local change observed ([04])
Synced --> RemotePending: remote change in cursor delta ([07])
Synced --> Conflicted: both diverged (planner, [05])
LocalDirty --> Hashing: stat changed → hash (CDC)
Hashing --> Synced: hash == synced (false alarm / our own write, [04])
Hashing --> Uploading: content differs → negotiate + PUT new chunks ([08])
Uploading --> Committing: all chunks present
Committing --> Synced: server accepts (advance S)
Committing --> Conflicted: server rejects (base version stale, [09])
RemotePending --> Downloading: fetch missing chunks ([08])
Downloading --> Verifying: reconstruct → BLAKE3 verify
Verifying --> Applying: temp file fsync'd
Applying --> Synced: atomic rename into place (advance S, [03])
Verifying --> Downloading: hash mismatch → refetch
Conflicted --> Resolving: create conflicted copy ([09])
Resolving --> Synced: both sides materialized as nodes
LocalDirty --> Error: io error
Uploading --> Error: transfer/permission/quota error
Downloading --> Error: transfer/disk-full error
Error --> Retrying: backoff + jitter ([10])
Retrying --> LocalDirty: re-plan
Retrying --> RemotePending: re-plan
Error --> DeadLetter: permanent (perm denied, quota) → surface to user
Synced --> Deleting: delete observed (one side)
Deleting --> Synced: delete applied to other side (advance S)
Deleting --> Conflicted: edit-vs-delete (edit wins, [09])
State invariants (the safety contract):
Applyingis atomic: a download is written to a temp file,fsync‘d, then atomically renamed into place; a partial file is never visible, and a crash mid-download leaves only an orphan temp (03 §4, ADR-0027).Committingcarries the base version: the server uses optimistic concurrency; a stale base ⇒Conflicted, never an overwrite (09, ADR-0008).Hashing → Syncedis the feedback-loop breaker: a change that hashes equal to Synced (our own write, a touch) produces no transfer (04 §4).Conflictedalways terminates by materializing both sides — it cannot livelock.
2. Operation (queue-item) state machine
Each planner-emitted operation is a durable queue row (03, 10):
stateDiagram-v2
[*] --> Ready: enqueued, no unmet deps
[*] --> Blocked: waiting on parent op (mkdir, rename)
Blocked --> Ready: dependency satisfied
Ready --> Running: scheduler dispatches to worker
Running --> Done: success → advance Synced tree
Running --> Failed: transient error
Failed --> Ready: backoff + jitter, attempts++
Failed --> DeadLetter: attempts exhausted / permanent error
Running --> Superseded: newer plan replaced this op
Superseded --> [*]
Done --> [*]
DeadLetter --> [*]: surfaced to user; node stays Error
Superseded matters: because the planner re-runs, an in-flight op may become obsolete
(the user reverted, or a newer remote change arrived). Ops are idempotent and
checked against current tree state before commit, so a superseded op is dropped safely.
3. Engine (global) state machine
stateDiagram-v2
[*] --> Starting
Starting --> InitialScan: open DB, load trees
InitialScan --> Reconciling: full scan + initial cursor pull
Reconciling --> Executing: ops scheduled
Executing --> Idle: trees converged
Idle --> Reconciling: watcher event / notification / retry timer
Idle --> Paused: user pause / metered network / battery
Paused --> Reconciling: resume
Executing --> Offline: network lost
Idle --> Offline: network lost
Offline --> CatchUp: network restored
CatchUp --> Reconciling: cursor pull (or 409 → full re-list, [07])
Reconciling --> SafetyHold: bulk-delete threshold tripped ([11], ADR-0027)
SafetyHold --> Reconciling: user confirms
Executing --> Error: fatal (DB corrupt → rebuild, [03])
Error --> Starting: rebuild cache + restart
- Offline → CatchUp → Reconciling is the offline-first path: queued local ops persist; on reconnect the engine pulls the remote delta, re-plans (which may surface conflicts), then drains the queue (10).
- SafetyHold is the bulk-delete circuit breaker: if a plan would delete more than a threshold of files (a classic “remote wiped my data” scenario), the engine pauses and asks before executing (11, ADR-0027).
4. Tradeoffs / Alternatives / Scaling
Tradeoffs. Explicit, fine-grained states add bookkeeping but make the safety properties (atomic apply, no-overwrite, feedback-loop suppression, delete safety) statically inspectable and testable rather than emergent. For a data-custody product that trade is mandatory.
Alternatives. A coarse “dirty/clean” flag per node is simpler but hides the exact point of failure and makes resumability ad hoc; the explicit machine makes crash recovery a matter of “what state was persisted.” Implicit/inferred state (recompute everything each time) is cleaner conceptually but loses in-flight transfer progress on every restart — unacceptable for large files.
Scaling. State lives in the node/queue rows (03); only
non-Synced nodes carry active state, so at rest the machine touches almost
nothing. The dirty-set planner (05) ensures
transitions are driven only for changed nodes.