Skip to content

fix(queue): close TOCTOU race between schedule() and start()#165

Open
appleboy wants to merge 6 commits intomasterfrom
worktree-pr-154
Open

fix(queue): close TOCTOU race between schedule() and start()#165
appleboy wants to merge 6 commits intomasterfrom
worktree-pr-154

Conversation

@appleboy
Copy link
Copy Markdown
Member

Summary

  • Fix a time-of-check-time-of-use race where concurrent schedule() calls could observe a stale BusyWorkers() value and over-dispatch beyond workerCount
  • Introduce a mutex-protected activeSlots counter that atomically reserves worker slots, eliminating the gap between the scheduling decision and the busy counter increment
  • Add regression test TestBusyWorkersNeverExceedsWorkerCount that validates the invariant under concurrency pressure

Root Cause

schedule() checked BusyWorkers() under the mutex and sent on the ready channel, but IncBusyWorker() in start() happened much later — after receiving from ready, spawning a request goroutine, and waiting for a task. During this window, concurrent schedule() calls (from work() defers) saw stale values and could send extra ready signals.

Approach

Added an internal activeSlots int64 field that tracks reserved (committed) worker slots. Both schedule() and start() check/modify this counter under the same mutex, closing the TOCTOU gap entirely. The public BusyWorkers() metric remains unchanged — it still only counts workers actively executing tasks.

Test plan

  • go test -race -v ./... — all existing tests pass
  • go test -race -count=50 -run=TestBusyWorkersNeverExceedsWorkerCount — stress test passes
  • go test -race -count=10 -run=TestDecreaseWorkerCount — dynamic worker count still works
  • Benchmarks show no regression (0 additional allocations)

Closes #154

🤖 Generated with Claude Code

Introduce an activeSlots counter protected by the existing mutex to
track reserved worker slots. This eliminates the window where
schedule() could observe a stale BusyWorkers() value and over-dispatch
beyond workerCount.

Closes #154

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 30, 2026 06:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency race in the queue scheduler so concurrent schedule() calls can’t over-dispatch beyond the configured workerCount.

Changes:

  • Add a mutex-protected activeSlots counter to reserve worker capacity earlier and close the TOCTOU gap between scheduling and “busy” accounting.
  • Decrement activeSlots when a task finishes (and on early-exit paths) to release capacity safely.
  • Add a regression test intended to assert BusyWorkers() never exceeds workerCount under load.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
queue.go Introduces activeSlots and integrates it into schedule(), start(), and work() to prevent over-dispatch.
ring_test.go Adds TestBusyWorkersNeverExceedsWorkerCount to validate the busy-workers invariant under concurrent execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ring_test.go
Comment thread ring_test.go Outdated
- Use a gate channel to block tasks and create scheduling pressure
- Monitor BusyWorkers continuously from a dedicated goroutine
- Replace fixed 2s sleep with deterministic WaitGroup completion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ring_test.go
Replace tight select/default loop with a 100µs ticker for sampling
BusyWorkers, avoiding unnecessary CPU consumption during tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ring_test.go Outdated
Comment thread ring_test.go Outdated
- Honor ctx.Done() in task function to prevent goroutine leaks on early exit
- Add 10s timeout when sending gate tokens to fail fast on deadlock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

appleboy and others added 2 commits April 30, 2026 16:11
- Extract tryReserveSlot, releaseSlot, signalReadyLocked helpers to replace three inline mutex+counter blocks
- Fold work() defer's slot release and schedule signal into one critical section, saving a lock round-trip per task
- Expand activeSlots field comment to clarify its lifecycle versus BusyWorkers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Shorten trailing comment to under 120 chars; helper docs carry the detail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Number of workers can be exceeded + Unit Test

2 participants