10 — Offline-First Workflows & the Sync Queue

Topics: offline-first workflows, sync queues. How the engine behaves with no network, and how the durable scheduler drains work safely when there is one.


1. Offline-first is free, because of “state, not activity”

The engine never assumes the network. All it ever does is observe state into the Local tree and (when reachable) the Remote tree, then re-plan. Offline simply means the Remote tree stops advancing:

This is the deep payoff of ADR-0022: “offline for 3 weeks” and “offline for 3 seconds” use the exact same code path — update Remote tree, re-plan. There is no special offline-merge logic to get wrong.

stateDiagram-v2
    [*] --> Online
    Online --> Offline: network lost
    Offline --> Offline: local edits queue durably ([03])
    Offline --> CatchUp: network restored
    CatchUp --> Replan: pull delta (or 409 → full re-list, [07])
    Replan --> Online: re-plan from 3 trees → drain queue (conflicts surfaced)

2. The sync queue / scheduler

Operations from the planner are durable QUEUE_OP rows (03) executed by the scheduler:

Concern Design
Durability persisted in SQLite → survives restart/crash; offline-safe
Ordering dependency DAG (05 §3): parents before children for creates, children-first for deletes, renames before content
Concurrency bounded worker pools (separate upload/download); independent subtrees parallel
Priority user-initiated > background; small/interactive files before huge; recently-used first
Fairness round-robin across files so one huge transfer can’t starve others
Retry exponential backoff + jitter on transient errors; attempts capped
Dead-letter permanent errors (403, quota, too-large) → DLQ, surfaced, node stays Error (06)
Throttle global + per-direction bandwidth caps; Wi-Fi-only / metered-aware; battery-aware pause
Server limits honor per-tenant rate limits; back off on 429
flowchart TB
    classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    classDef o fill:#bbf7d0,stroke:#15803d,color:#111827;
    plan["Planner emits ops"]:::a --> q[("Durable queue (SQLite)")]:::a
    q --> dep{"deps satisfied?"}:::d
    dep -- no --> blocked["Blocked (wait on parent)"]:::d
    dep -- yes --> ready["Ready (priority-ordered)"]:::a
    ready --> pool["Bounded worker pools (up/down), fair-scheduled"]:::a
    pool --> res{"result?"}:::d
    res -- success --> adv["advance Synced tree"]:::o
    res -- transient --> rty["backoff + jitter → Ready"]:::d
    res -- permanent --> dlq["Dead-letter → surface to user"]:::d
    res -- superseded --> drop["drop (newer plan)"]:::o

3. Error taxonomy (drives retry vs surface)

Class Examples Action
Transient network drop, 5xx, 429, timeout, disk-busy retry w/ backoff + jitter
Conflict 409 stale base version resolve (09), not a retry
Permanent 403 perm denied, 413 too large, 507/quota exceeded dead-letter + surface; do not silently drop
Local disk full, file locked, permission pause affected op, surface; keep data safe

The discipline: transient → retry, permanent → surface, conflict → resolve, never silently discard a local change.


4. Crash recovery


5. Tradeoffs / Alternatives / Scaling

Tradeoffs. A persisted queue costs DB writes per op, but is what makes offline-first and crash recovery real rather than aspirational. Re-planning instead of rebasing the queue does redundant planning work on reconnect — cheap (metadata-only, dirty-set scoped) and far safer than hand-written rebase logic.

Alternatives considered.

Scaling concerns.