Skip to content

fix(agentmain): replace fixed-timeout queue polling with lifecycle-aware drain_queue#374

Open
sontianye wants to merge 1 commit into
lsdefine:mainfrom
sontianye:worktree-fix-task-timeout
Open

fix(agentmain): replace fixed-timeout queue polling with lifecycle-aware drain_queue#374
sontianye wants to merge 1 commit into
lsdefine:mainfrom
sontianye:worktree-fix-task-timeout

Conversation

@sontianye
Copy link
Copy Markdown
Contributor

Problem

In --task mode, any job running longer than 300 s raised an unhandled
queue.Empty exception. The background agent thread was still alive and
continued pushing items onto the display queue — but with no consumer left
to drain it, queue.put() eventually blocked, freezing the agent thread
permanently. All subsequent messages received no response.

--reflect mode had a try/except around the same pattern, so it degraded
more gracefully, but a 180 s ceiling still caused silent failures on
longer-running reflect tasks.

The root cause in both cases was the same: the consumer was racing against a
wall-clock deadline with no knowledge of whether the agent was still alive.

Fix

Add GenericAgent.drain_queue(dq) — a generator that ties the wait to the
agent's actual lifecycle instead of a fixed timeout:

  • Polls with a short 2 s interval so CPU is not wasted
  • Keeps looping unconditionally while self.is_running is True, regardless
    of elapsed time
  • When is_running flips to False, performs one final non-blocking flush
    to collect any items enqueued in the narrow window between the last put()
    and the flag being cleared
  • Yields every item so callers can handle both streaming next progress and
    the final done result in a single loop

Both --task and --reflect modes are updated to use drain_queue.
The method is also available as a public API for third-party frontends and
subagent orchestration code that currently open-code their own polling loops.

Testing

Manually verified with a task that runs ~10 minutes (70 turns, slow model):

  • Before: process hangs after 5 minutes, subsequent messages get no reply
  • After: task completes normally, output written correctly, next message
    is handled immediately

…are drain_queue

Tasks running longer than the hardcoded 300 s (task mode) or 180 s (reflect
mode) caused queue.Empty to be raised mid-execution. In task mode there was
no try/except, so the exception propagated uncaught and left the background
agent thread blocked forever on a queue no consumer would drain — making the
process unresponsive to all subsequent messages.

Root cause: the consumer had no knowledge of whether the agent was still
alive; it raced against a wall-clock deadline instead of the agent lifecycle.

Fix: add GenericAgent.drain_queue(), a generator that polls with a short
interval (2 s) but keeps looping unconditionally while self.is_running is
True. When the agent finishes it does one final non-blocking flush to collect
items enqueued in the narrow window between the last put() and is_running
being cleared. Both task mode and reflect mode are updated to use it.

The helper is also available to third-party frontends and subagent code that
currently open-code their own polling loops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant