Skip to content

fix(backup): make the per-site backup/restore lock crash-safe (flock)#480

Open
mrrobot47 wants to merge 2 commits into
EasyEngine:developfrom
mrrobot47:fix/backup-lock-robustness
Open

fix(backup): make the per-site backup/restore lock crash-safe (flock)#480
mrrobot47 wants to merge 2 commits into
EasyEngine:developfrom
mrrobot47:fix/backup-lock-robustness

Conversation

@mrrobot47

Copy link
Copy Markdown
Member

Summary

The per-site backup/restore lock (<site>.lock) was a file-existence lock removed only on success (and one disk-space) path. Any crash/OOM/error exit left a stale lock that permanently blocked every future backup AND restore of that site ("Another backup/restore process is running") until an operator deleted it by hand. The fs->exists()-then-fs->dumpFile() was also a same-site backup↔restore TOCTOU race.

Fix

  • Convert the per-site lock to flock() on a handle opened c+e (O_CLOEXEC), acquired non-blocking in pre_backup_restore_checks() and released by an idempotent release_site_backup_lock() on the success paths + a register_shutdown_function. flock is released by the OS on process death, so the lock can no longer go stale; this also makes the lock atomic (fixes the TOCTOU).
  • Add the e (O_CLOEXEC) flag to the global lock fd too, so backup subprocesses (rclone/mysqldump/docker exec) don't inherit the descriptor and keep the lock held after the parent dies.
  • Stop EE_Site_Command::shut_down_function() from deleting the lock file — unlinking a file another process holds an flock on lets a later process re-lock a fresh inode at the same path and silently break mutual exclusion.

Verification

  • Empirically confirmed (php:7.4-cli container): with c+ the lock stays held by a lingering child after the parent exits; with c+e it is released. flock is OS-released on SIGKILL with no handler running.
  • Reviewed by two independent reviewers (correctness + design lenses); the shut_down_function lock-deletion was caught by review and fixed here.

Out of scope (noted)

restore() does not acquire the global lock (pre-existing). Left as a separate, deliberate behavior decision.

The per-site lock was a file-existence lock (fs->exists check + fs->dumpFile create), removed only on success and one disk-space error path. Any other exit -- archive/DB/upload failure, restore mismatch, OOM/SIGKILL, Ctrl-C -- left a stale <site>.lock that permanently blocked every future backup AND restore of that site with 'Another backup/restore process is running' until an operator deleted it by hand. The check-then-create was also a TOCTOU race between same-site backup and restore.

Convert it to an flock() lock on a handle opened with the 'e' (O_CLOEXEC) flag, acquired non-blocking in pre_backup_restore_checks() and released by an idempotent release_site_backup_lock() on the success paths and via register_shutdown_function. flock is released by the OS on process death (verified for SIGKILL), so the lock can no longer go stale; O_CLOEXEC stops backup subprocesses (rclone/mysqldump/docker exec) from inheriting the descriptor and holding it after the parent exits (verified: c+ stays held, c+e releases). The global lock fd gains the same 'e' flag.

Also remove the lock-file deletion from EE_Site_Command::shut_down_function(): with flock, unlinking a held lock file lets a later process re-lock a fresh inode at the same path and break mutual exclusion.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes backup/restore locking crash-safe by switching the per-site <site>.lock from a file-existence sentinel to an OS-managed flock() held on an open file handle, preventing stale locks after crashes and removing a same-site TOCTOU race between backup and restore.

Changes:

  • Add a per-site flock() lock handle acquired non-blocking during pre_backup_restore_checks(), and release it via an idempotent release_site_backup_lock() plus a shutdown handler.
  • Use fopen(..., 'c+e') (CLOEXEC) for both per-site and global lock handles to prevent subprocess FD inheritance from prolonging locks.
  • Remove shutdown-time unlinking of the per-site lock file to avoid breaking mutual exclusion semantics with flock().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/helper/Site_Backup_Restore.php Replaces per-site lock sentinel with flock() on a CLOEXEC handle and adds an idempotent release method; also adds CLOEXEC to the global lock handle.
src/helper/class-ee-site.php Stops deleting the per-site lock file during shutdown to preserve correct flock() mutual exclusion behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/helper/class-ee-site.php Outdated
Comment on lines +2208 to +2212
// The per-site backup/restore lock is now an flock() held by
// Site_Backup_Restore and released automatically on process exit. It must
// NOT be deleted here: unlinking a file that another process currently
// holds an flock on lets a later process create a fresh inode at the same
// path and acquire its own lock, silently breaking mutual exclusion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 7a2404a (reworded to "a flock").

"a flock" reads correctly (flock is pronounced with a consonant sound); addresses a review comment on the lock-robustness change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants