Skip to content

fix(DISET): prevent services from getting permanently stuck in throttle mode#8471

Draft
ryuwd wants to merge 5 commits intoDIRACGrid:integrationfrom
ryuwd:roneil-throttle-recovery
Draft

fix(DISET): prevent services from getting permanently stuck in throttle mode#8471
ryuwd wants to merge 5 commits intoDIRACGrid:integrationfrom
ryuwd:roneil-throttle-recovery

Conversation

@ryuwd
Copy link
Contributor

@ryuwd ryuwd commented Feb 26, 2026

Production DISET services can enter throttle mode and never recover, requiring manual restarts. This PR addresses the root cause and adds safety mechanisms:

  • Fix throttle logic: The previous mechanism accepted one connection every 0.25s even while overloaded. When threads were stuck (blocked DB queries, deadlocks, etc.), each accepted connection grew the queue further, making recovery impossible. Now all connections are rejected while wantsThrottle is True, and the unnecessary 0.25s sleep between rejections is removed — sel.select(timeout=10) already rate-limits when no connections arrive.
  • Add diagnostic logging: Throttle state tracking (start time, periodic diagnostics) is maintained as Service instance state, persisting across __acceptIncomingConnection re-entries. Logs WARN on entry with queue/thread stats, periodic WARN every 30s, and INFO on clear with duration.
  • Add configurable auto-restart: New MaxThrottleDuration CS option (default: 0 = disabled). When enabled, the service process exits after the specified seconds of continuous throttling, allowing the process supervisor (runsv) to restart it cleanly. Duration tracking is reliable since it lives on the Service instance rather than as a local variable that resets on select timeouts.

Operational Impact

Behavioral change

Previously, a throttled service would still accept 1 connection every 0.25s (~4 req/s). Now it rejects all connections while throttled. This means:

  • Clients will see rejections sooner during genuine overload, rather than queueing behind stuck threads
  • Transient load spikes will still recover naturally (throttle clears once queue drains)
  • Net positive: prevents the self-reinforcing stuck state that required manual restarts

New logging

  • WARN on throttle entry with queue=X/Y, threads=X/Y diagnostics
  • WARN every 30s while throttling persists with updated stats
  • INFO when throttle clears with total duration
  • Existing per-rejection WARN is preserved for monitoring
  • State persists across __acceptIncomingConnection re-entries (no duplicate "entering" logs)

Auto-restart

  • Disabled by default (MaxThrottleDuration = 0)
  • Opt-in per service via CS: Systems/<System>/Services/<Service>/MaxThrottleDuration = 600
  • When enabled and throttle exceeds the limit, emits a FATAL log and exits
  • Requires a process supervisor (runsv, systemd, etc.) to restart the service
  • Duration accumulates correctly even when connections arrive slowly (state on Service instance, not local variable)

Recommended deployment

  1. Deploy with auto-restart disabled (default) to validate the throttle fix
  2. Monitor logs for throttle duration patterns
  3. Enable MaxThrottleDuration for services known to get stuck (suggested starting value: 600s)

Test plan

  • Hotfix into a service which throttles frequently and validate behavior

BEGINRELEASENOTES
*Core
FIX: prevent DISET services from getting permanently stuck in throttle mode by rejecting all connections while overloaded instead of accepting one every 0.25s
NEW: add throttle duration tracking with diagnostic logging (queue/thread stats on entry, periodic warnings, clear notification)
NEW: add configurable MaxThrottleDuration CS option to auto-restart services stuck in throttle mode (default: disabled)
ENDRELEASENOTES

… growth

The previous throttle mechanism accepted one connection every 0.25s
even while the service was overloaded. When threads were stuck (e.g.
blocked on DB queries or deadlocked), each accepted connection added
to the already-full queue, making recovery impossible.

Now all incoming connections are rejected while wantsThrottle is True,
with a brief sleep to avoid busy-spinning. This prevents the
self-reinforcing stuck state where the queue grows faster than it
can drain.
Track when throttling starts and log state transitions with
queue/thread diagnostics. Periodic warnings are emitted every 30s
while throttling persists, making it easier to diagnose stuck
services from logs without needing to attach a debugger.

Adds Service.throttleDiagnostics() to expose queue size, max queue,
active threads, and max threads for logging purposes.
…ation

When a service is stuck in throttle mode (all threads blocked, queue
full), it cannot recover without external intervention. Add a
configurable MaxThrottleDuration CS option (default: 0 = disabled)
that triggers a process exit after the specified number of seconds of
continuous throttling. The process supervisor (e.g. runsv) then
restarts the service cleanly, clearing all stuck state.

When enabled, a FATAL log message is emitted before exit with full
queue/thread diagnostics for post-mortem analysis.
@ryuwd ryuwd force-pushed the roneil-throttle-recovery branch from 06a5f45 to 6c44a2e Compare February 26, 2026 12:40
…very

The 0.25s sleep between throttle rejections was a leftover from the
old "accept 1 every 0.25s" mechanism. With the new reject-all
approach, sel.select(timeout=10) already rate-limits when no
connections arrive. Removing the sleep lets the service reject
incoming connections immediately (faster client failover) and
re-check wantsThrottle sooner (faster recovery when queue drains).
Throttle state (start time, last log time) was local to
__acceptIncomingConnection, which resets on every re-entry from
serve() (e.g. after a 10s select timeout). This meant
MaxThrottleDuration could never accumulate past 10s if connections
arrived slowly.

Move tracking into Service instance state alongside wantsThrottle,
which now handles entry/exit logging and periodic diagnostics.
Add throttleDuration property for the reactor to check against
MaxThrottleDuration. The ServiceReactor throttle block is simplified
to just rejection + max duration check.
@aldbr aldbr requested review from atsareg and chaen and removed request for atsareg February 26, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant