10 — Offline-First Workflows & the Sync Queue

Topics: offline-first workflows, sync queues. How the engine behaves with no network, and how the durable scheduler drains work safely when there is one.

1. Offline-first is free, because of “state, not activity”

The engine never assumes the network. All it ever does is observe state into the Local tree and (when reachable) the Remote tree, then re-plan. Offline simply means the Remote tree stops advancing:

While offline: local edits flow into the Local tree as normal; the planner emits operations that persist in the durable queue (03); upload ops sit in Blocked/Ready waiting for connectivity. Nothing is lost; the user works unimpeded.
On reconnect: pull the cursor delta → update the Remote tree → re-plan from the three trees. Because the planner is pure and recomputes from state, there is no fragile “rebase the queued ops” step — the old plan is discarded and a fresh, correct plan is produced against the new remote reality. Divergences that arose during the offline window surface as conflicts (09) and resolve normally.

This is the deep payoff of ADR-0022: “offline for 3 weeks” and “offline for 3 seconds” use the exact same code path — update Remote tree, re-plan. There is no special offline-merge logic to get wrong.

stateDiagram-v2
    [*] --> Online
    Online --> Offline: network lost
    Offline --> Offline: local edits queue durably ([03])
    Offline --> CatchUp: network restored
    CatchUp --> Replan: pull delta (or 409 → full re-list, [07])
    Replan --> Online: re-plan from 3 trees → drain queue (conflicts surfaced)

2. The sync queue / scheduler

Operations from the planner are durable QUEUE_OP rows (03) executed by the scheduler:

Concern	Design
Durability	persisted in SQLite → survives restart/crash; offline-safe
Ordering	dependency DAG (05 §3): parents before children for creates, children-first for deletes, renames before content
Concurrency	bounded worker pools (separate upload/download); independent subtrees parallel
Priority	user-initiated > background; small/interactive files before huge; recently-used first
Fairness	round-robin across files so one huge transfer can’t starve others
Retry	exponential backoff + jitter on transient errors; attempts capped
Dead-letter	permanent errors (403, quota, too-large) → DLQ, surfaced, node stays `Error` (06)
Throttle	global + per-direction bandwidth caps; Wi-Fi-only / metered-aware; battery-aware pause
Server limits	honor per-tenant rate limits; back off on 429

flowchart TB
    classDef a fill:#c7d2fe,stroke:#3730a3,color:#111827;
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    classDef o fill:#bbf7d0,stroke:#15803d,color:#111827;
    plan["Planner emits ops"]:::a --> q[("Durable queue (SQLite)")]:::a
    q --> dep{"deps satisfied?"}:::d
    dep -- no --> blocked["Blocked (wait on parent)"]:::d
    dep -- yes --> ready["Ready (priority-ordered)"]:::a
    ready --> pool["Bounded worker pools (up/down), fair-scheduled"]:::a
    pool --> res{"result?"}:::d
    res -- success --> adv["advance Synced tree"]:::o
    res -- transient --> rty["backoff + jitter → Ready"]:::d
    res -- permanent --> dlq["Dead-letter → surface to user"]:::d
    res -- superseded --> drop["drop (newer plan)"]:::o

3. Error taxonomy (drives retry vs surface)

Class	Examples	Action
Transient	network drop, 5xx, 429, timeout, disk-busy	retry w/ backoff + jitter
Conflict	409 stale base version	resolve (09), not a retry
Permanent	403 perm denied, 413 too large, 507/quota exceeded	dead-letter + surface; do not silently drop
Local	disk full, file locked, permission	pause affected op, surface; keep data safe

The discipline: transient → retry, permanent → surface, conflict → resolve, never silently discard a local change.

4. Crash recovery

The queue and trees are in the transactional DB → on restart the engine resumes pending ops; in-flight transfers resume from committed chunks (storage/05).
Re-planning on startup reconciles anything that changed while down (04).

5. Tradeoffs / Alternatives / Scaling

Tradeoffs. A persisted queue costs DB writes per op, but is what makes offline-first and crash recovery real rather than aspirational. Re-planning instead of rebasing the queue does redundant planning work on reconnect — cheap (metadata-only, dirty-set scoped) and far safer than hand-written rebase logic.

Alternatives considered.

In-memory queue: faster, but loses all pending work on crash/restart — unacceptable offline-first.
Rebase queued ops onto new remote state: error-prone; superseded by pure re-planning (ADR-0022).
Strict FIFO (no priority/fairness): simple but lets a 50 GB upload block an urgent small edit; rejected.

Scaling concerns.

Huge queues (millions of pending ops after a big import) → batch enqueue, paginate scheduling, cap in-flight; the queue is indexed by (state, priority, next_retry_at).
Starvation → fairness scheduling + priority aging.
Retry storms after an outage → jittered backoff + a global concurrency cap so reconnect doesn’t hammer the server (server-side jitter too, 07).
Metered/mobile → defer large/background transfers; honor user network policy.