ADR-0027 — Sync safety guards (atomic apply, self-write suppression, bulk-change brake)
- Status: Accepted
- Date: 2026-06-11
- Related: sync/11 failure modes, sync/06 state machine, ADR-0022, storage ADR-0019
Context
Sync failures that lose or corrupt data are catastrophic and irreversible from the user’s view; failures that merely delay or duplicate are recoverable. Several specific hazards recur across all sync products: partially-written files exposed after a crash, an upload feedback loop from the client’s own writes, and mass-deletion propagation (a folder unmounts, ransomware encrypts a tree, or a buggy plan deletes everything — and the engine faithfully replicates the disaster to the cloud and every device).
Decision
Adopt three non-negotiable safety guards:
- Atomic local apply: downloads are written to a temp file,
fsync‘d, and atomically renamed into place; the DB records “synced” only after the rename (apply-then-record). A partial file is never visible; a crash leaves only a discardable temp. - Self-write suppression: record
(path, expected hash)before applying a download so the resulting watcher event is absorbed, not re-uploaded (ADR-0025). - Bulk-change circuit breaker (
SafetyHold): if a plan would delete or overwrite more than a threshold (e.g. > N files or > X% of the synced set), pause and require user confirmation (with a preview), after sanity checks (is the volume mounted? does this look like mass encryption?). Backed by server trash + version history (storage/07) so even a confirmed mistake is recoverable within the retention window.
Consequences
Positive
- No partially-synced files ever observed; crash-safe application.
- No upload feedback loops.
- The worst-case data-loss class (mass delete/encrypt) is stopped before propagation — the defense that distinguishes a trustworthy sync product from a dangerous one.
Negative / costs
- The circuit breaker can false-positive on legitimate large deletes → mitigated by a clear confirm-with-preview UX and tunable thresholds; the asymmetric risk (data loss vs a confirmation prompt) justifies erring toward caution.
- Atomic rename requires temp space on the same volume; self-write tracking adds bookkeeping.
Alternatives considered
- Direct in-place writes: simpler, but exposes partial files and corrupts on crash. Rejected.
- No bulk-change brake (trust the plan): the literal cause of “the cloud deleted all my files” incidents. Rejected.
- Rely solely on server trash/versioning to undo mistakes: good defense-in-depth, but reactive; the brake prevents the bad propagation in the first place. We do both.
Scaling
Threshold checks are O(plan size) on already-computed plans (cheap); the brake only engages on rare large destructive plans, so steady-state sync is unaffected.