PR #378 made the right call decoupling requeues from the failure circuit breaker, but it acknowledged a tradeoff:
workers which are legitimately in a corrupted state will requeue more tests than before
In practice this plays out badly: a worker with corrupted runtime state (global variable pollution, database connection corruption, etc.) will keep running, reserving tests, and requeueing them because it can't actually complete them. With requeues invisible to max_consecutive_failures, nothing kills that worker and it can requeue tests by the hundreds.
Proposed fix
A second, independent circuit breaker that counts consecutive requeues per worker, not final failures. Something like max_consecutive_requeues (config, separate from max_consecutive_failures). When a worker crosses the threshold, it gets fenced off the same way the existing breaker works today.
This keeps the two behaviors cleanly decoupled:
max_consecutive_failures → "these tests are actually broken"
max_consecutive_requeues → "this worker is broken"
PR #378 made the right call decoupling requeues from the failure circuit breaker, but it acknowledged a tradeoff:
In practice this plays out badly: a worker with corrupted runtime state (global variable pollution, database connection corruption, etc.) will keep running, reserving tests, and requeueing them because it can't actually complete them. With requeues invisible to
max_consecutive_failures, nothing kills that worker and it can requeue tests by the hundreds.Proposed fix
A second, independent circuit breaker that counts consecutive requeues per worker, not final failures. Something like
max_consecutive_requeues(config, separate frommax_consecutive_failures). When a worker crosses the threshold, it gets fenced off the same way the existing breaker works today.This keeps the two behaviors cleanly decoupled:
max_consecutive_failures→ "these tests are actually broken"max_consecutive_requeues→ "this worker is broken"