ADR-0027 — Sync safety guards (atomic apply, self-write suppression, bulk-change brake)

Status: Accepted
Date: 2026-06-11
Related: sync/11 failure modes, sync/06 state machine, ADR-0022, storage ADR-0019

Context

Sync failures that lose or corrupt data are catastrophic and irreversible from the user’s view; failures that merely delay or duplicate are recoverable. Several specific hazards recur across all sync products: partially-written files exposed after a crash, an upload feedback loop from the client’s own writes, and mass-deletion propagation (a folder unmounts, ransomware encrypts a tree, or a buggy plan deletes everything — and the engine faithfully replicates the disaster to the cloud and every device).

Decision

Adopt three non-negotiable safety guards:

Atomic local apply: downloads are written to a temp file, fsync‘d, and atomically renamed into place; the DB records “synced” only after the rename (apply-then-record). A partial file is never visible; a crash leaves only a discardable temp.
Self-write suppression: record (path, expected hash) before applying a download so the resulting watcher event is absorbed, not re-uploaded (ADR-0025).
Bulk-change circuit breaker (SafetyHold): if a plan would delete or overwrite more than a threshold (e.g. > N files or > X% of the synced set), pause and require user confirmation (with a preview), after sanity checks (is the volume mounted? does this look like mass encryption?). Backed by server trash + version history (storage/07) so even a confirmed mistake is recoverable within the retention window.

Consequences

Positive

No partially-synced files ever observed; crash-safe application.
No upload feedback loops.
The worst-case data-loss class (mass delete/encrypt) is stopped before propagation — the defense that distinguishes a trustworthy sync product from a dangerous one.

Negative / costs

The circuit breaker can false-positive on legitimate large deletes → mitigated by a clear confirm-with-preview UX and tunable thresholds; the asymmetric risk (data loss vs a confirmation prompt) justifies erring toward caution.
Atomic rename requires temp space on the same volume; self-write tracking adds bookkeeping.

Alternatives considered

Direct in-place writes: simpler, but exposes partial files and corrupts on crash. Rejected.
No bulk-change brake (trust the plan): the literal cause of “the cloud deleted all my files” incidents. Rejected.
Rely solely on server trash/versioning to undo mistakes: good defense-in-depth, but reactive; the brake prevents the bad propagation in the first place. We do both.

Scaling

Threshold checks are O(plan size) on already-computed plans (cheap); the brake only engages on rare large destructive plans, so steady-state sync is unaffected.