fix(code-reviews): parallelize dispatch and fix stuck queued reviews#1177
fix(code-reviews): parallelize dispatch and fix stuck queued reviews#1177alex-alecu wants to merge 9 commits intomainfrom
Conversation
…eview Reorder so the CF worker dispatch happens before the DB status update. If dispatch fails, the review stays 'pending' and can be retried on the next dispatch cycle, eliminating the stuck-in-queued state that occurs when the process crashes between the DB write and the dispatch call.
Replace the sequential for-loop with Promise.allSettled so all pending reviews are dispatched concurrently. Each review has independent config, payload preparation, and worker dispatch — no ordering dependency. This reduces dispatch time from ~60-100s to ~3-5s for 20 reviews.
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Issue Details (click to expand)CRITICAL
WARNING
SUGGESTION
Fix these issues in Kilo Cloud Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: None. Files Reviewed (1 files)
Reviewed by gpt-5.4-20260305 · 304,699 tokens |
…t races Add a conditional UPDATE ... WHERE status = 'pending' with RETURNING to atomically claim each review before dispatching. If another concurrent dispatcher already claimed the review, skip it. This prevents the double-dispatch race that the order swap alone introduced.
Move the atomic claim to just before the worker call (after all prep work) to minimise the window where a crash leaves a review stuck in 'queued'. On dispatch failure, revert the claim back to 'pending'. As a safety net, the pending-reviews query now also picks up stale queued reviews (older than 5 minutes) that were abandoned by a crashed process, so they become eligible for re-dispatch.
…, scope stale claim predicate Worker dispatch failures (timeout, 5xx) now revert the claim to 'pending' and return false instead of re-throwing — the outer handler only marks truly non-retriable errors (missing config, payload prep failure) as 'failed'. The atomic claim WHERE for stale-queued rows now includes the same staleness predicate (updated_at < cutoff) as the selector query, ensuring only one dispatcher can revive an abandoned row.
Exclude stale queued rows from active slot accounting so abandoned claims can be reclaimed even when an owner appears to be at capacity. Keep ambiguous worker dispatch failures queued for stale recovery instead of reverting to pending, and persist agentVersion without rewriting status or failing a live review if the follow-up DB write errors.
| eq(cloud_agent_code_reviews.status, 'pending'), | ||
| and( | ||
| eq(cloud_agent_code_reviews.status, 'queued'), | ||
| lt(cloud_agent_code_reviews.updated_at, staleCutoff) |
There was a problem hiding this comment.
WARNING: Stale queued rows are not guaranteed to be abandoned
CodeReviewOrchestrator.updateStatus('running') explicitly continues when the DB callback fails (cloudflare-code-review-infra/src/code-review-orchestrator.ts:261), so a live review can sit in Postgres as stale queued even though its Durable Object is already running. Re-dispatching it here sends a second POST /review for the same reviewId; start() then resets that existing DO back to queued (cloudflare-code-review-infra/src/index.ts:84, cloudflare-code-review-infra/src/code-review-orchestrator.ts:365), which can launch the review twice.
…m:Kilo-Org/cloud into fix/parallelize-dispatch-fix-stuck-queued
| agentVersion?: string; | ||
| previousCloudAgentSessionId?: string; | ||
| }): Promise<{ status: CodeReviewStatus }> { | ||
| await this.loadState(); |
There was a problem hiding this comment.
WARNING: loadState() can clobber a live orchestrator state
start() is now the duplicate-dispatch path, so it can run while this Durable Object is already streaming a review. loadState() always replaces this.state with the last persisted snapshot before the early return, which drops any in-memory changes that have not been flushed yet (for example buffered events and token usage accumulated in processEventStream()). Guard the reload with if (!this.state) so duplicate starts stay idempotent without rewinding a live review.
Summary
queuedstatus. Previously, the DB status was updated toqueuedbefore dispatching to the CF worker — if the process crashed between those two operations, the review would never be dispatched but could never be retried (it was no longerpending). The fix reorders these operations so the worker dispatch happens first; on failure the review stayspendingand is retried on the next dispatch cycle.Promise.allSettledfor parallel execution. Each review's dispatch is independent (own config, payload prep, worker call), so there's no ordering dependency. This reduces dispatch time for 20 reviews from ~60-100s to ~3-5s, eliminating the dashboard experience of "1 running, many queued for minutes."Verification
pnpm typecheck— root project passes cleanly (pre-existing errors incloud-agent-next/wrapperunrelated to changes)pnpm run test— all 174 test suites pass, 2746 tests passdispatch-pending-reviews.ts; changes are mechanical (reorder two lines + replace for-loop with Promise.allSettled)Visual Changes
N/A
Reviewer Notes
prepareReviewPayload()is the heaviest call per dispatch (~3-5s, multiple DB + GitHub/GitLab API calls). Parallelizing means all reviews hit these APIs concurrently — should be fine since each review targets different resources.isFeatureFlagEnabled) will also run concurrently across all reviews. No contention expected.reviewId) so there's no contention on the worker side.runReview()has a guard (status !== 'queued') that prevents double execution, so even in the unlikely event of a double-dispatch, the review won't run twice.