You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(nas-backup): timeout, free-space check, trap-based cleanup with VM resume
Three independent reliability fixes for the KVM NAS backup script, layered on
top of the existing quiesce + EXIT_CLEANUP_FAILED groundwork:
1. BACKUP_TIMEOUT env var (default 6h) bounds the libvirt domjobinfo wait
loop in backup_running_vm. Today a stuck QEMU backup holds the agent's
command slot until the orchestrator-level timeout fires. The new guard
issues domjobabort and exits non-zero so the agent reclaims the slot
promptly.
2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after
mount and before any qemu-img convert in both backup_running_vm and
backup_stopped_vm. Fail-fast on a near-full NAS instead of failing
mid-write halfway through a multi-GiB convert.
3. trap cleanup EXIT replaces the six explicit cleanup() call sites as the
primary cleanup mechanism so orphan NFS mounts no longer accumulate when
the script dies to SIGTERM, SIGINT, or any uncaught set -e failure
between the explicit call sites. cleanup() is now guarded by
CLEANUP_DONE so the trap doesn't re-run an already-completed cleanup
from an explicit call.
cleanup() additionally resumes the VM if it's still paused — backup-begin
holds the guest paused briefly and a failed backup mid-pause currently
leaves the guest stuck in 'paused' state until an operator intervenes.
Targets main; supersedes the 4.20-targeted version of this PR.
0 commit comments