fix(backup): make the per-site backup/restore lock crash-safe (flock)#480
Open
mrrobot47 wants to merge 2 commits into
Open
fix(backup): make the per-site backup/restore lock crash-safe (flock)#480mrrobot47 wants to merge 2 commits into
mrrobot47 wants to merge 2 commits into
Conversation
The per-site lock was a file-existence lock (fs->exists check + fs->dumpFile create), removed only on success and one disk-space error path. Any other exit -- archive/DB/upload failure, restore mismatch, OOM/SIGKILL, Ctrl-C -- left a stale <site>.lock that permanently blocked every future backup AND restore of that site with 'Another backup/restore process is running' until an operator deleted it by hand. The check-then-create was also a TOCTOU race between same-site backup and restore. Convert it to an flock() lock on a handle opened with the 'e' (O_CLOEXEC) flag, acquired non-blocking in pre_backup_restore_checks() and released by an idempotent release_site_backup_lock() on the success paths and via register_shutdown_function. flock is released by the OS on process death (verified for SIGKILL), so the lock can no longer go stale; O_CLOEXEC stops backup subprocesses (rclone/mysqldump/docker exec) from inheriting the descriptor and holding it after the parent exits (verified: c+ stays held, c+e releases). The global lock fd gains the same 'e' flag. Also remove the lock-file deletion from EE_Site_Command::shut_down_function(): with flock, unlinking a held lock file lets a later process re-lock a fresh inode at the same path and break mutual exclusion.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes backup/restore locking crash-safe by switching the per-site <site>.lock from a file-existence sentinel to an OS-managed flock() held on an open file handle, preventing stale locks after crashes and removing a same-site TOCTOU race between backup and restore.
Changes:
- Add a per-site
flock()lock handle acquired non-blocking duringpre_backup_restore_checks(), and release it via an idempotentrelease_site_backup_lock()plus a shutdown handler. - Use
fopen(..., 'c+e')(CLOEXEC) for both per-site and global lock handles to prevent subprocess FD inheritance from prolonging locks. - Remove shutdown-time unlinking of the per-site lock file to avoid breaking mutual exclusion semantics with
flock().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/helper/Site_Backup_Restore.php | Replaces per-site lock sentinel with flock() on a CLOEXEC handle and adds an idempotent release method; also adds CLOEXEC to the global lock handle. |
| src/helper/class-ee-site.php | Stops deleting the per-site lock file during shutdown to preserve correct flock() mutual exclusion behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2208
to
+2212
| // The per-site backup/restore lock is now an flock() held by | ||
| // Site_Backup_Restore and released automatically on process exit. It must | ||
| // NOT be deleted here: unlinking a file that another process currently | ||
| // holds an flock on lets a later process create a fresh inode at the same | ||
| // path and acquire its own lock, silently breaking mutual exclusion. |
Member
Author
There was a problem hiding this comment.
Good catch — fixed in 7a2404a (reworded to "a flock").
"a flock" reads correctly (flock is pronounced with a consonant sound); addresses a review comment on the lock-robustness change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The per-site backup/restore lock (
<site>.lock) was a file-existence lock removed only on success (and one disk-space) path. Any crash/OOM/error exit left a stale lock that permanently blocked every future backup AND restore of that site ("Another backup/restore process is running") until an operator deleted it by hand. Thefs->exists()-then-fs->dumpFile()was also a same-site backup↔restore TOCTOU race.Fix
flock()on a handle openedc+e(O_CLOEXEC), acquired non-blocking inpre_backup_restore_checks()and released by an idempotentrelease_site_backup_lock()on the success paths + aregister_shutdown_function.flockis released by the OS on process death, so the lock can no longer go stale; this also makes the lock atomic (fixes the TOCTOU).e(O_CLOEXEC) flag to the global lock fd too, so backup subprocesses (rclone/mysqldump/docker exec) don't inherit the descriptor and keep the lock held after the parent dies.EE_Site_Command::shut_down_function()from deleting the lock file — unlinking a file another process holds anflockon lets a later process re-lock a fresh inode at the same path and silently break mutual exclusion.Verification
c+the lock stays held by a lingering child after the parent exits; withc+eit is released.flockis OS-released onSIGKILLwith no handler running.shut_down_functionlock-deletion was caught by review and fixed here.Out of scope (noted)
restore()does not acquire the global lock (pre-existing). Left as a separate, deliberate behavior decision.