06 — Sync State Machines

Deliverable: the sync state machine. Three machines compose the engine: the per-node lifecycle (the core one), the operation/queue-item lifecycle, and the engine lifecycle. Explicit states make the invariants — especially atomicity and “never lose data” — checkable.

1. Per-node sync state machine (the core)

Each node moves through states driven by observations (watcher/scan, cursor delta) and operation completions. Synced is the stable resting state where S == R == L.

stateDiagram-v2
    [*] --> Synced: S == R == L
    Synced --> LocalDirty: local change observed ([04])
    Synced --> RemotePending: remote change in cursor delta ([07])
    Synced --> Conflicted: both diverged (planner, [05])

    LocalDirty --> Hashing: stat changed → hash (CDC)
    Hashing --> Synced: hash == synced (false alarm / our own write, [04])
    Hashing --> Uploading: content differs → negotiate + PUT new chunks ([08])
    Uploading --> Committing: all chunks present
    Committing --> Synced: server accepts (advance S)
    Committing --> Conflicted: server rejects (base version stale, [09])

    RemotePending --> Downloading: fetch missing chunks ([08])
    Downloading --> Verifying: reconstruct → BLAKE3 verify
    Verifying --> Applying: temp file fsync'd
    Applying --> Synced: atomic rename into place (advance S, [03])
    Verifying --> Downloading: hash mismatch → refetch

    Conflicted --> Resolving: create conflicted copy ([09])
    Resolving --> Synced: both sides materialized as nodes

    LocalDirty --> Error: io error
    Uploading --> Error: transfer/permission/quota error
    Downloading --> Error: transfer/disk-full error
    Error --> Retrying: backoff + jitter ([10])
    Retrying --> LocalDirty: re-plan
    Retrying --> RemotePending: re-plan
    Error --> DeadLetter: permanent (perm denied, quota) → surface to user

    Synced --> Deleting: delete observed (one side)
    Deleting --> Synced: delete applied to other side (advance S)
    Deleting --> Conflicted: edit-vs-delete (edit wins, [09])

State invariants (the safety contract):

Applying is atomic: a download is written to a temp file, fsync‘d, then atomically renamed into place; a partial file is never visible, and a crash mid-download leaves only an orphan temp (03 §4, ADR-0027).
Committing carries the base version: the server uses optimistic concurrency; a stale base ⇒ Conflicted, never an overwrite (09, ADR-0008).
Hashing → Synced is the feedback-loop breaker: a change that hashes equal to Synced (our own write, a touch) produces no transfer (04 §4).
Conflicted always terminates by materializing both sides — it cannot livelock.

2. Operation (queue-item) state machine

Each planner-emitted operation is a durable queue row (03, 10):

stateDiagram-v2
    [*] --> Ready: enqueued, no unmet deps
    [*] --> Blocked: waiting on parent op (mkdir, rename)
    Blocked --> Ready: dependency satisfied
    Ready --> Running: scheduler dispatches to worker
    Running --> Done: success → advance Synced tree
    Running --> Failed: transient error
    Failed --> Ready: backoff + jitter, attempts++
    Failed --> DeadLetter: attempts exhausted / permanent error
    Running --> Superseded: newer plan replaced this op
    Superseded --> [*]
    Done --> [*]
    DeadLetter --> [*]: surfaced to user; node stays Error

Superseded matters: because the planner re-runs, an in-flight op may become obsolete (the user reverted, or a newer remote change arrived). Ops are idempotent and checked against current tree state before commit, so a superseded op is dropped safely.

3. Engine (global) state machine

stateDiagram-v2
    [*] --> Starting
    Starting --> InitialScan: open DB, load trees
    InitialScan --> Reconciling: full scan + initial cursor pull
    Reconciling --> Executing: ops scheduled
    Executing --> Idle: trees converged
    Idle --> Reconciling: watcher event / notification / retry timer
    Idle --> Paused: user pause / metered network / battery
    Paused --> Reconciling: resume
    Executing --> Offline: network lost
    Idle --> Offline: network lost
    Offline --> CatchUp: network restored
    CatchUp --> Reconciling: cursor pull (or 409 → full re-list, [07])
    Reconciling --> SafetyHold: bulk-delete threshold tripped ([11], ADR-0027)
    SafetyHold --> Reconciling: user confirms
    Executing --> Error: fatal (DB corrupt → rebuild, [03])
    Error --> Starting: rebuild cache + restart

Offline → CatchUp → Reconciling is the offline-first path: queued local ops persist; on reconnect the engine pulls the remote delta, re-plans (which may surface conflicts), then drains the queue (10).
SafetyHold is the bulk-delete circuit breaker: if a plan would delete more than a threshold of files (a classic “remote wiped my data” scenario), the engine pauses and asks before executing (11, ADR-0027).

4. Tradeoffs / Alternatives / Scaling

Tradeoffs. Explicit, fine-grained states add bookkeeping but make the safety properties (atomic apply, no-overwrite, feedback-loop suppression, delete safety) statically inspectable and testable rather than emergent. For a data-custody product that trade is mandatory.

Alternatives. A coarse “dirty/clean” flag per node is simpler but hides the exact point of failure and makes resumability ad hoc; the explicit machine makes crash recovery a matter of “what state was persisted.” Implicit/inferred state (recompute everything each time) is cleaner conceptually but loses in-flight transfer progress on every restart — unacceptable for large files.

Scaling. State lives in the node/queue rows (03); only non-Synced nodes carry active state, so at rest the machine touches almost nothing. The dirty-set planner (05) ensures transitions are driven only for changed nodes.