fix(DISET): prevent services from getting permanently stuck in throttle mode by ryuwd · Pull Request #8471 · DIRACGrid/DIRAC

ryuwd · 2026-02-26T12:36:33Z

Production DISET services can enter throttle mode and never recover, requiring manual restarts. This PR addresses the root cause and adds safety mechanisms:

Fix throttle logic: The previous mechanism accepted one connection every 0.25s even while overloaded. When threads were stuck (blocked DB queries, deadlocks, etc.), each accepted connection grew the queue further, making recovery impossible. Now all connections are rejected while wantsThrottle is True, and the unnecessary 0.25s sleep between rejections is removed — sel.select(timeout=10) already rate-limits when no connections arrive.
Add diagnostic logging: Throttle state tracking (start time, periodic diagnostics) is maintained as Service instance state, persisting across __acceptIncomingConnection re-entries. Logs WARN on entry with queue/thread stats, periodic WARN every 30s, and INFO on clear with duration.
Add configurable auto-restart: New MaxThrottleDuration CS option (default: 0 = disabled). When enabled, the service process exits after the specified seconds of continuous throttling, allowing the process supervisor (runsv) to restart it cleanly. Duration tracking is reliable since it lives on the Service instance rather than as a local variable that resets on select timeouts.

Operational Impact

Behavioral change

Previously, a throttled service would still accept 1 connection every 0.25s (~4 req/s). Now it rejects all connections while throttled. This means:

Clients will see rejections sooner during genuine overload, rather than queueing behind stuck threads
Transient load spikes will still recover naturally (throttle clears once queue drains)
Net positive: prevents the self-reinforcing stuck state that required manual restarts

New logging

WARN on throttle entry with queue=X/Y, threads=X/Y diagnostics
WARN every 30s while throttling persists with updated stats
INFO when throttle clears with total duration
Existing per-rejection WARN is preserved for monitoring
State persists across __acceptIncomingConnection re-entries (no duplicate "entering" logs)

Auto-restart

Disabled by default (MaxThrottleDuration = 0)
Opt-in per service via CS: Systems/<System>/Services/<Service>/MaxThrottleDuration = 600
When enabled and throttle exceeds the limit, emits a FATAL log and exits
Requires a process supervisor (runsv, systemd, etc.) to restart the service
Duration accumulates correctly even when connections arrive slowly (state on Service instance, not local variable)

Recommended deployment

Deploy with auto-restart disabled (default) to validate the throttle fix
Monitor logs for throttle duration patterns
Enable MaxThrottleDuration for services known to get stuck (suggested starting value: 600s)

Test plan

Hotfix into a service which throttles frequently and validate behavior

BEGINRELEASENOTES
*Core
FIX: prevent DISET services from getting permanently stuck in throttle mode by rejecting all connections while overloaded instead of accepting one every 0.25s
NEW: add throttle duration tracking with diagnostic logging (queue/thread stats on entry, periodic warnings, clear notification)
NEW: add configurable MaxThrottleDuration CS option to auto-restart services stuck in throttle mode (default: disabled)
ENDRELEASENOTES

… growth The previous throttle mechanism accepted one connection every 0.25s even while the service was overloaded. When threads were stuck (e.g. blocked on DB queries or deadlocked), each accepted connection added to the already-full queue, making recovery impossible. Now all incoming connections are rejected while wantsThrottle is True, with a brief sleep to avoid busy-spinning. This prevents the self-reinforcing stuck state where the queue grows faster than it can drain.

Track when throttling starts and log state transitions with queue/thread diagnostics. Periodic warnings are emitted every 30s while throttling persists, making it easier to diagnose stuck services from logs without needing to attach a debugger. Adds Service.throttleDiagnostics() to expose queue size, max queue, active threads, and max threads for logging purposes.

…ation When a service is stuck in throttle mode (all threads blocked, queue full), it cannot recover without external intervention. Add a configurable MaxThrottleDuration CS option (default: 0 = disabled) that triggers a process exit after the specified number of seconds of continuous throttling. The process supervisor (e.g. runsv) then restarts the service cleanly, clearing all stuck state. When enabled, a FATAL log message is emitted before exit with full queue/thread diagnostics for post-mortem analysis.

…very The 0.25s sleep between throttle rejections was a leftover from the old "accept 1 every 0.25s" mechanism. With the new reject-all approach, sel.select(timeout=10) already rate-limits when no connections arrive. Removing the sleep lets the service reject incoming connections immediately (faster client failover) and re-check wantsThrottle sooner (faster recovery when queue drains).

Throttle state (start time, last log time) was local to __acceptIncomingConnection, which resets on every re-entry from serve() (e.g. after a 10s select timeout). This meant MaxThrottleDuration could never accumulate past 10s if connections arrived slowly. Move tracking into Service instance state alongside wantsThrottle, which now handles entry/exit logging and periodic diagnostics. Add throttleDuration property for the reactor to check against MaxThrottleDuration. The ServiceReactor throttle block is simplified to just rejection + max duration check.

fstagni · 2026-03-10T14:45:14Z

Was this PR hotfixed somewhere, or at least tested in the certification setup?

ryuwd · 2026-03-18T21:19:33Z

It's not been hotfixed or tested in certification yet

ryuwd added 3 commits February 26, 2026 13:19

ryuwd force-pushed the roneil-throttle-recovery branch from 06a5f45 to 6c44a2e Compare February 26, 2026 12:40

ryuwd added 2 commits February 26, 2026 13:42

aldbr requested review from atsareg and chaen and removed request for atsareg February 26, 2026 12:54

aldbr marked this pull request as ready for review March 10, 2026 11:49

aldbr requested review from atsareg and fstagni as code owners March 10, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(DISET): prevent services from getting permanently stuck in throttle mode#8471

fix(DISET): prevent services from getting permanently stuck in throttle mode#8471
ryuwd wants to merge 5 commits into
DIRACGrid:integrationfrom
ryuwd:roneil-throttle-recovery

ryuwd commented Feb 26, 2026 •

edited

Loading

Uh oh!

fstagni commented Mar 10, 2026

Uh oh!

ryuwd commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryuwd commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Operational Impact

Behavioral change

New logging

Auto-restart

Recommended deployment

Test plan

Uh oh!

fstagni commented Mar 10, 2026

Uh oh!

ryuwd commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryuwd commented Feb 26, 2026 •

edited

Loading