11 — Failure Modes & Analysis

Deliverable: failure analysis. Answers: What are common failure modes? Sync is a field of subtle, data-destroying edge cases; this is the catalog and the defenses. Each row: failure → cause → detection → mitigation → residual risk. The overriding principle is asymmetric: a sync bug that loses data is catastrophic; one that merely delays or duplicates is recoverable — so every defense favors safety.


A. Observation failures (watcher / scan)

Failure Cause Detection Mitigation Residual
Missed change watcher overflow (IN_Q_OVERFLOW, RDCW 0-byte, FSEvents coalesce) overflow signal / next scan watcher = hint; authoritative rescan + hash (04, ADR-0025) higher latency until rescan
Feedback loop our own write fires watcher → re-upload hash == Synced self-write suppression + hash compare (04 §4) none (content-based)
No events at all NFS/SMB, some FUSE n/a polling fallback (degraded latency) latency only
Move seen as delete+create watcher granularity hash/inode match move detection → rename op (04 §5) rare false-move (still correct, just re-uploads)

B. Clock & metadata

Failure Cause Detection Mitigation Residual
Wrong change decision from mtime clock skew, mtime reset/forged, timezone hash disagrees with mtime content hash is truth; mtime only a fast-path hint; scheduled deep re-hash (03 §3) extra hashing
inode reuse misidentify OS recycles inode hash check identity by node-ID + hash, not inode alone negligible

C. Concurrency & conflict

Failure Cause Detection Mitigation Residual
Lost update (overwrite) two devices edit same file optimistic concurrency (stale base) conflicted copy, never overwrite (09, ADR-0008/0026) user reconciles
Conflict storm rapid concurrent edits rate of conflicted copies rate-limit + coalesce + warn (09 §7) noise (no data loss)
N devices → N conflict copies client-side resolution server-anchored resolution → one copy fans out (09 §3) none

D. Crash & partial state

Failure Cause Detection Mitigation Residual
Partial/garbage file visible crash mid-download temp + fsync + atomic rename; never expose partial (03 §4, ADR-0027) orphan temp (GC’d)
Partial upload crash mid-upload re-negotiate resumable via committed chunks (storage/05) re-send last chunk
DB-says-synced, file-missing crash between apply and record startup scan apply-then-record ordering; reconcile on scan re-download
Local DB corruption disk/power fault integrity check rebuildable cache: rescan + re-list, Synced=∅ (conservative, non-destructive) (03 §5) slow rebuild

E. Protocol

Failure Cause Detection Mitigation Residual
Cursor invalid/expired journal pruned, namespace reset, long offline 409 / FAILED_PRECONDITION full re-list → rebuild Remote tree → diff (07 §2) one expensive resync
Lost notification realtime tier drop periodic poll notify is lossy by design; pull is authoritative (07 §1) latency only
Duplicate notification/delta at-least-once idempotent apply apply by node-ID + version is idempotent none
Thundering herd mass reconnect server load jitter on longpoll/reconnect (07 §8) brief load spike

F. Data-loss hazards (the scary ones)

Failure Cause Detection Mitigation Residual
Mass-delete propagation folder deleted/unmounted/ransomware encrypts → looks like edits+deletes plan deletes/changes > threshold bulk-change circuit breaker → SafetyHold (§ below, ADR-0027) requires user confirm
Edit lost to a delete delete vs edit race planner edit beats delete (09) undone delete surfaced
Accidental delete user error server trash + version history retention (storage/07) restore from trash
Mount point empty (drive offline) volume unmounted → all files “gone” empty/over-threshold circuit breaker + “is the volume mounted?” check pause until resolved

G. Path & encoding (cross-platform)

Failure Cause Detection Mitigation Residual
Case collision File/file on case-insensitive FS (macOS/Windows) name compare keep both w/ suffix; surface (09) one renamed
Unicode normalization macOS NFD vs Linux NFC normalized compare normalize names; conflicted suffix on true clash rare rename
Illegal chars / reserved names : \\ CON PRN etc. per OS validation sanitize/escape on download; surface display-name differences
Path too long Windows MAX_PATH, depth length check long-path APIs; surface unsyncable some files skipped + surfaced

H. Resource limits

Failure Cause Mitigation
Disk full (local) downloads exceed space pause downloads, surface; never partial-apply
Quota exceeded (server) tenant over quota uploads dead-letter + surface; local data kept safe
inotify watch limit millions of dirs bounded watches + scan-heavier mode for cold subtrees (04)
Memory huge file streaming chunking, never buffer whole (08)

I. Network & security

Failure Cause Mitigation
Partition / flaky mobile, captive portal offline queue + resume (10)
Metered network mobile data Wi-Fi-only / defer large transfers (10)
MITM / tampered bytes hostile network/CDN TLS + content-hash verify every chunk (storage/04)
Malicious/buggy server response bad manifest/chunk hash verification rejects; version checks
Replay of stale delta cursor monotonic + version compare → idempotent

J. Special files

Item Policy
Symlinks configurable: skip (default) or store as link metadata; never follow blindly (loop/exfiltration risk)
Hardlinks synced as independent files (no hardlink semantics across devices)
Sparse files preserve sparseness where FS supports; otherwise store logical bytes
FIFOs / devices / sockets skipped + surfaced (not regular files)
Permissions / xattrs / ACLs store as metadata where portable; document non-portability across OS

The bulk-change circuit breaker (ADR-0027)

The defense against the worst class — “sync deleted/encrypted everything” (ransomware, a bad unmount, a script gone wrong, a buggy plan):

flowchart TB
    classDef d fill:#fde68a,stroke:#b45309,color:#111827;
    classDef c fill:#fecaca,stroke:#b91c1c,color:#111827;
    classDef o fill:#bbf7d0,stroke:#15803d,color:#111827;
    plan["Planner produced ops"]:::d --> m{"destructive ops > threshold?<br/>(e.g. > N files OR > X% of synced set<br/>deleted/overwritten)"}:::d
    m -- no --> go["execute normally"]:::o
    m -- yes --> hold["SafetyHold: pause execution"]:::c
    hold --> chk{"volume mounted?<br/>encryption signature?<br/>plausible user action?"}:::c
    chk --> ask["Surface to user: confirm or cancel<br/>(show what would change)"]:::c
    ask -- confirm --> go
    ask -- cancel --> revert["keep local + remote as-is; do not propagate"]:::o

Even if confirmed-and-wrong, server trash + version history (storage/07) mean deletes are recoverable for the retention window — defense in depth.


Testing strategy (how we gain confidence)

These map directly to the project’s “definition of done” for sync (09 evolution roadmap).