11 — Failure Modes & Analysis
Deliverable: failure analysis. Answers: What are common failure modes? Sync is
a field of subtle, data-destroying edge cases; this is the catalog and the defenses.
Each row: failure → cause → detection → mitigation → residual risk. The
overriding principle is asymmetric: a sync bug that loses data is catastrophic; one
that merely delays or duplicates is recoverable — so every defense favors safety.
A. Observation failures (watcher / scan)
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Missed change |
watcher overflow (IN_Q_OVERFLOW, RDCW 0-byte, FSEvents coalesce) |
overflow signal / next scan |
watcher = hint; authoritative rescan + hash (04, ADR-0025) |
higher latency until rescan |
| Feedback loop |
our own write fires watcher → re-upload |
hash == Synced |
self-write suppression + hash compare (04 §4) |
none (content-based) |
| No events at all |
NFS/SMB, some FUSE |
n/a |
polling fallback (degraded latency) |
latency only |
| Move seen as delete+create |
watcher granularity |
hash/inode match |
move detection → rename op (04 §5) |
rare false-move (still correct, just re-uploads) |
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Wrong change decision from mtime |
clock skew, mtime reset/forged, timezone |
hash disagrees with mtime |
content hash is truth; mtime only a fast-path hint; scheduled deep re-hash (03 §3) |
extra hashing |
| inode reuse misidentify |
OS recycles inode |
hash check |
identity by node-ID + hash, not inode alone |
negligible |
C. Concurrency & conflict
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Lost update (overwrite) |
two devices edit same file |
optimistic concurrency (stale base) |
conflicted copy, never overwrite (09, ADR-0008/0026) |
user reconciles |
| Conflict storm |
rapid concurrent edits |
rate of conflicted copies |
rate-limit + coalesce + warn (09 §7) |
noise (no data loss) |
| N devices → N conflict copies |
client-side resolution |
— |
server-anchored resolution → one copy fans out (09 §3) |
none |
D. Crash & partial state
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Partial/garbage file visible |
crash mid-download |
— |
temp + fsync + atomic rename; never expose partial (03 §4, ADR-0027) |
orphan temp (GC’d) |
| Partial upload |
crash mid-upload |
re-negotiate |
resumable via committed chunks (storage/05) |
re-send last chunk |
| DB-says-synced, file-missing |
crash between apply and record |
startup scan |
apply-then-record ordering; reconcile on scan |
re-download |
| Local DB corruption |
disk/power fault |
integrity check |
rebuildable cache: rescan + re-list, Synced=∅ (conservative, non-destructive) (03 §5) |
slow rebuild |
E. Protocol
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Cursor invalid/expired |
journal pruned, namespace reset, long offline |
409 / FAILED_PRECONDITION |
full re-list → rebuild Remote tree → diff (07 §2) |
one expensive resync |
| Lost notification |
realtime tier drop |
periodic poll |
notify is lossy by design; pull is authoritative (07 §1) |
latency only |
| Duplicate notification/delta |
at-least-once |
idempotent apply |
apply by node-ID + version is idempotent |
none |
| Thundering herd |
mass reconnect |
server load |
jitter on longpoll/reconnect (07 §8) |
brief load spike |
F. Data-loss hazards (the scary ones)
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Mass-delete propagation |
folder deleted/unmounted/ransomware encrypts → looks like edits+deletes |
plan deletes/changes > threshold |
bulk-change circuit breaker → SafetyHold (§ below, ADR-0027) |
requires user confirm |
| Edit lost to a delete |
delete vs edit race |
planner |
edit beats delete (09) |
undone delete surfaced |
| Accidental delete |
user error |
— |
server trash + version history retention (storage/07) |
restore from trash |
| Mount point empty (drive offline) |
volume unmounted → all files “gone” |
empty/over-threshold |
circuit breaker + “is the volume mounted?” check |
pause until resolved |
| Failure |
Cause |
Detection |
Mitigation |
Residual |
| Case collision |
File/file on case-insensitive FS (macOS/Windows) |
name compare |
keep both w/ suffix; surface (09) |
one renamed |
| Unicode normalization |
macOS NFD vs Linux NFC |
normalized compare |
normalize names; conflicted suffix on true clash |
rare rename |
| Illegal chars / reserved names |
: \\ CON PRN etc. per OS |
validation |
sanitize/escape on download; surface |
display-name differences |
| Path too long |
Windows MAX_PATH, depth |
length check |
long-path APIs; surface unsyncable |
some files skipped + surfaced |
H. Resource limits
| Failure |
Cause |
Mitigation |
| Disk full (local) |
downloads exceed space |
pause downloads, surface; never partial-apply |
| Quota exceeded (server) |
tenant over quota |
uploads dead-letter + surface; local data kept safe |
| inotify watch limit |
millions of dirs |
bounded watches + scan-heavier mode for cold subtrees (04) |
| Memory |
huge file |
streaming chunking, never buffer whole (08) |
I. Network & security
| Failure |
Cause |
Mitigation |
| Partition / flaky |
mobile, captive portal |
offline queue + resume (10) |
| Metered network |
mobile data |
Wi-Fi-only / defer large transfers (10) |
| MITM / tampered bytes |
hostile network/CDN |
TLS + content-hash verify every chunk (storage/04) |
| Malicious/buggy server response |
bad manifest/chunk |
hash verification rejects; version checks |
| Replay of stale delta |
— |
cursor monotonic + version compare → idempotent |
J. Special files
| Item |
Policy |
| Symlinks |
configurable: skip (default) or store as link metadata; never follow blindly (loop/exfiltration risk) |
| Hardlinks |
synced as independent files (no hardlink semantics across devices) |
| Sparse files |
preserve sparseness where FS supports; otherwise store logical bytes |
| FIFOs / devices / sockets |
skipped + surfaced (not regular files) |
| Permissions / xattrs / ACLs |
store as metadata where portable; document non-portability across OS |
The bulk-change circuit breaker (ADR-0027)
The defense against the worst class — “sync deleted/encrypted everything” (ransomware, a
bad unmount, a script gone wrong, a buggy plan):
flowchart TB
classDef d fill:#fde68a,stroke:#b45309,color:#111827;
classDef c fill:#fecaca,stroke:#b91c1c,color:#111827;
classDef o fill:#bbf7d0,stroke:#15803d,color:#111827;
plan["Planner produced ops"]:::d --> m{"destructive ops > threshold?<br/>(e.g. > N files OR > X% of synced set<br/>deleted/overwritten)"}:::d
m -- no --> go["execute normally"]:::o
m -- yes --> hold["SafetyHold: pause execution"]:::c
hold --> chk{"volume mounted?<br/>encryption signature?<br/>plausible user action?"}:::c
chk --> ask["Surface to user: confirm or cancel<br/>(show what would change)"]:::c
ask -- confirm --> go
ask -- cancel --> revert["keep local + remote as-is; do not propagate"]:::o
Even if confirmed-and-wrong, server trash + version history (storage/07)
mean deletes are recoverable for the retention window — defense in depth.
Testing strategy (how we gain confidence)
- Property tests of the pure planner (Dropbox CanopyCheck-style): generate arbitrary
(R,L,S) triples → assert the engine converges all three to equality with no data
loss (05 §4).
- Concurrent-edit harness (Trinity-style / the ADR-0008 conflict harness): two
simulated devices edit the same file offline → assert exactly one conflicted copy,
both versions recoverable (09).
- Fault injection: kill mid-upload / mid-download / mid-apply → assert no partial
files, no dangling refs, correct resume (maps to storage/11 GC invariants).
- Watcher chaos: force overflow / drop events → assert rescan recovers full
correctness.
- Cross-platform matrix: case/unicode/path-length collisions across macOS/Windows/Linux.
- Cursor-reset: invalidate cursor mid-sync → assert clean full-relist recovery.
These map directly to the project’s “definition of done” for sync
(09 evolution roadmap).