Skip to content

Add a separate requeue circuit breaker to handle worker corruption #406

@ianks

Description

@ianks

PR #378 made the right call decoupling requeues from the failure circuit breaker, but it acknowledged a tradeoff:

workers which are legitimately in a corrupted state will requeue more tests than before

In practice this plays out badly: a worker with corrupted runtime state (global variable pollution, database connection corruption, etc.) will keep running, reserving tests, and requeueing them because it can't actually complete them. With requeues invisible to max_consecutive_failures, nothing kills that worker and it can requeue tests by the hundreds.

Proposed fix

A second, independent circuit breaker that counts consecutive requeues per worker, not final failures. Something like max_consecutive_requeues (config, separate from max_consecutive_failures). When a worker crosses the threshold, it gets fenced off the same way the existing breaker works today.

This keeps the two behaviors cleanly decoupled:

  • max_consecutive_failures → "these tests are actually broken"
  • max_consecutive_requeues → "this worker is broken"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions