10 — Offline-First Workflows & the Sync Queue
Topics: offline-first workflows, sync queues. How the engine behaves with no network, and how the durable scheduler drains work safely when there is one.
1. Offline-first is free, because of “state, not activity”
The engine never assumes the network. All it ever does is observe state into the Local tree and (when reachable) the Remote tree, then re-plan. Offline simply means the Remote tree stops advancing:
- While offline: local edits flow into the Local tree as normal; the planner emits
operations that persist in the durable queue (03); upload
ops sit in
Blocked/Readywaiting for connectivity. Nothing is lost; the user works unimpeded. - On reconnect: pull the cursor delta → update the Remote tree → re-plan from the three trees. Because the planner is pure and recomputes from state, there is no fragile “rebase the queued ops” step — the old plan is discarded and a fresh, correct plan is produced against the new remote reality. Divergences that arose during the offline window surface as conflicts (09) and resolve normally.
This is the deep payoff of ADR-0022: “offline for 3 weeks” and “offline for 3 seconds” use the exact same code path — update Remote tree, re-plan. There is no special offline-merge logic to get wrong.
stateDiagram-v2
[*] --> Online
Online --> Offline: network lost
Offline --> Offline: local edits queue durably ([03])
Offline --> CatchUp: network restored
CatchUp --> Replan: pull delta (or 409 → full re-list, [07])
Replan --> Online: re-plan from 3 trees → drain queue (conflicts surfaced)
2. The sync queue / scheduler
Operations from the planner are durable QUEUE_OP rows (03)
executed by the scheduler:
| Concern | Design |
|---|---|
| Durability | persisted in SQLite → survives restart/crash; offline-safe |
| Ordering | dependency DAG (05 §3): parents before children for creates, children-first for deletes, renames before content |
| Concurrency | bounded worker pools (separate upload/download); independent subtrees parallel |
| Priority | user-initiated > background; small/interactive files before huge; recently-used first |
| Fairness | round-robin across files so one huge transfer can’t starve others |
| Retry | exponential backoff + jitter on transient errors; attempts capped |
| Dead-letter | permanent errors (403, quota, too-large) → DLQ, surfaced, node stays Error (06) |
| Throttle | global + per-direction bandwidth caps; Wi-Fi-only / metered-aware; battery-aware pause |
| Server limits | honor per-tenant rate limits; back off on 429 |
flowchart TB
classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
classDef d fill:#fde68a,stroke:#b45309,color:#111827;
classDef o fill:#bbf7d0,stroke:#15803d,color:#111827;
plan["Planner emits ops"]:::a --> q[("Durable queue (SQLite)")]:::a
q --> dep{"deps satisfied?"}:::d
dep -- no --> blocked["Blocked (wait on parent)"]:::d
dep -- yes --> ready["Ready (priority-ordered)"]:::a
ready --> pool["Bounded worker pools (up/down), fair-scheduled"]:::a
pool --> res{"result?"}:::d
res -- success --> adv["advance Synced tree"]:::o
res -- transient --> rty["backoff + jitter → Ready"]:::d
res -- permanent --> dlq["Dead-letter → surface to user"]:::d
res -- superseded --> drop["drop (newer plan)"]:::o
3. Error taxonomy (drives retry vs surface)
| Class | Examples | Action |
|---|---|---|
| Transient | network drop, 5xx, 429, timeout, disk-busy | retry w/ backoff + jitter |
| Conflict | 409 stale base version | resolve (09), not a retry |
| Permanent | 403 perm denied, 413 too large, 507/quota exceeded | dead-letter + surface; do not silently drop |
| Local | disk full, file locked, permission | pause affected op, surface; keep data safe |
The discipline: transient → retry, permanent → surface, conflict → resolve, never silently discard a local change.
4. Crash recovery
- The queue and trees are in the transactional DB → on restart the engine resumes pending ops; in-flight transfers resume from committed chunks (storage/05).
- Re-planning on startup reconciles anything that changed while down (04).
5. Tradeoffs / Alternatives / Scaling
Tradeoffs. A persisted queue costs DB writes per op, but is what makes offline-first and crash recovery real rather than aspirational. Re-planning instead of rebasing the queue does redundant planning work on reconnect — cheap (metadata-only, dirty-set scoped) and far safer than hand-written rebase logic.
Alternatives considered.
- In-memory queue: faster, but loses all pending work on crash/restart — unacceptable offline-first.
- Rebase queued ops onto new remote state: error-prone; superseded by pure re-planning (ADR-0022).
- Strict FIFO (no priority/fairness): simple but lets a 50 GB upload block an urgent small edit; rejected.
Scaling concerns.
- Huge queues (millions of pending ops after a big import) → batch enqueue, paginate
scheduling, cap in-flight; the queue is indexed by
(state, priority, next_retry_at). - Starvation → fairness scheduling + priority aging.
- Retry storms after an outage → jittered backoff + a global concurrency cap so reconnect doesn’t hammer the server (server-side jitter too, 07).
- Metered/mobile → defer large/background transfers; honor user network policy.