fix(DISET): prevent services from getting permanently stuck in throttle mode#8471
Draft
ryuwd wants to merge 5 commits intoDIRACGrid:integrationfrom
Draft
fix(DISET): prevent services from getting permanently stuck in throttle mode#8471ryuwd wants to merge 5 commits intoDIRACGrid:integrationfrom
ryuwd wants to merge 5 commits intoDIRACGrid:integrationfrom
Conversation
… growth The previous throttle mechanism accepted one connection every 0.25s even while the service was overloaded. When threads were stuck (e.g. blocked on DB queries or deadlocked), each accepted connection added to the already-full queue, making recovery impossible. Now all incoming connections are rejected while wantsThrottle is True, with a brief sleep to avoid busy-spinning. This prevents the self-reinforcing stuck state where the queue grows faster than it can drain.
Track when throttling starts and log state transitions with queue/thread diagnostics. Periodic warnings are emitted every 30s while throttling persists, making it easier to diagnose stuck services from logs without needing to attach a debugger. Adds Service.throttleDiagnostics() to expose queue size, max queue, active threads, and max threads for logging purposes.
…ation When a service is stuck in throttle mode (all threads blocked, queue full), it cannot recover without external intervention. Add a configurable MaxThrottleDuration CS option (default: 0 = disabled) that triggers a process exit after the specified number of seconds of continuous throttling. The process supervisor (e.g. runsv) then restarts the service cleanly, clearing all stuck state. When enabled, a FATAL log message is emitted before exit with full queue/thread diagnostics for post-mortem analysis.
06a5f45 to
6c44a2e
Compare
…very The 0.25s sleep between throttle rejections was a leftover from the old "accept 1 every 0.25s" mechanism. With the new reject-all approach, sel.select(timeout=10) already rate-limits when no connections arrive. Removing the sleep lets the service reject incoming connections immediately (faster client failover) and re-check wantsThrottle sooner (faster recovery when queue drains).
Throttle state (start time, last log time) was local to __acceptIncomingConnection, which resets on every re-entry from serve() (e.g. after a 10s select timeout). This meant MaxThrottleDuration could never accumulate past 10s if connections arrived slowly. Move tracking into Service instance state alongside wantsThrottle, which now handles entry/exit logging and periodic diagnostics. Add throttleDuration property for the reactor to check against MaxThrottleDuration. The ServiceReactor throttle block is simplified to just rejection + max duration check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Production DISET services can enter throttle mode and never recover, requiring manual restarts. This PR addresses the root cause and adds safety mechanisms:
wantsThrottleis True, and the unnecessary 0.25s sleep between rejections is removed —sel.select(timeout=10)already rate-limits when no connections arrive.__acceptIncomingConnectionre-entries. Logs WARN on entry with queue/thread stats, periodic WARN every 30s, and INFO on clear with duration.MaxThrottleDurationCS option (default: 0 = disabled). When enabled, the service process exits after the specified seconds of continuous throttling, allowing the process supervisor (runsv) to restart it cleanly. Duration tracking is reliable since it lives on the Service instance rather than as a local variable that resets on select timeouts.Operational Impact
Behavioral change
Previously, a throttled service would still accept 1 connection every 0.25s (~4 req/s). Now it rejects all connections while throttled. This means:
New logging
queue=X/Y, threads=X/Ydiagnostics__acceptIncomingConnectionre-entries (no duplicate "entering" logs)Auto-restart
MaxThrottleDuration = 0)Systems/<System>/Services/<Service>/MaxThrottleDuration = 600Recommended deployment
MaxThrottleDurationfor services known to get stuck (suggested starting value: 600s)Test plan
BEGINRELEASENOTES
*Core
FIX: prevent DISET services from getting permanently stuck in throttle mode by rejecting all connections while overloaded instead of accepting one every 0.25s
NEW: add throttle duration tracking with diagnostic logging (queue/thread stats on entry, periodic warnings, clear notification)
NEW: add configurable MaxThrottleDuration CS option to auto-restart services stuck in throttle mode (default: disabled)
ENDRELEASENOTES