Skip to content

fix(code-reviews): parallelize dispatch and fix stuck queued reviews#1177

Open
alex-alecu wants to merge 9 commits intomainfrom
fix/parallelize-dispatch-fix-stuck-queued
Open

fix(code-reviews): parallelize dispatch and fix stuck queued reviews#1177
alex-alecu wants to merge 9 commits intomainfrom
fix/parallelize-dispatch-fix-stuck-queued

Conversation

@alex-alecu
Copy link
Contributor

Summary

  • Fixes a crash-safety bug where reviews can get permanently stuck in queued status. Previously, the DB status was updated to queued before dispatching to the CF worker — if the process crashed between those two operations, the review would never be dispatched but could never be retried (it was no longer pending). The fix reorders these operations so the worker dispatch happens first; on failure the review stays pending and is retried on the next dispatch cycle.
  • Replaces sequential review dispatch with Promise.allSettled for parallel execution. Each review's dispatch is independent (own config, payload prep, worker call), so there's no ordering dependency. This reduces dispatch time for 20 reviews from ~60-100s to ~3-5s, eliminating the dashboard experience of "1 running, many queued for minutes."

Verification

  • pnpm typecheck — root project passes cleanly (pre-existing errors in cloud-agent-next/wrapper unrelated to changes)
  • pnpm run test — all 174 test suites pass, 2746 tests pass
  • No existing test file for dispatch-pending-reviews.ts; changes are mechanical (reorder two lines + replace for-loop with Promise.allSettled)

Visual Changes

N/A

Reviewer Notes

  • prepareReviewPayload() is the heaviest call per dispatch (~3-5s, multiple DB + GitHub/GitLab API calls). Parallelizing means all reviews hit these APIs concurrently — should be fine since each review targets different resources.
  • PostHog feature flag checks (isFeatureFlagEnabled) will also run concurrently across all reviews. No contention expected.
  • The CF worker client has a 10s timeout per dispatch. With 20 parallel dispatches, all 20 will be in-flight at once. Each creates its own DO (keyed by reviewId) so there's no contention on the worker side.
  • The DO's runReview() has a guard (status !== 'queued') that prevents double execution, so even in the unlikely event of a double-dispatch, the review won't run twice.

…eview

Reorder so the CF worker dispatch happens before the DB status update.
If dispatch fails, the review stays 'pending' and can be retried on the
next dispatch cycle, eliminating the stuck-in-queued state that occurs
when the process crashes between the DB write and the dispatch call.
Replace the sequential for-loop with Promise.allSettled so all pending
reviews are dispatched concurrently. Each review has independent config,
payload preparation, and worker dispatch — no ordering dependency.
This reduces dispatch time from ~60-100s to ~3-5s for 20 reviews.
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 17, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

File Line Issue

WARNING

File Line Issue
cloudflare-code-review-infra/src/code-review-orchestrator.ts 378 Unconditionally reloading Durable Object storage on the duplicate-start path can overwrite live in-memory state and drop buffered events or usage data that have not been persisted yet.

SUGGESTION

File Line Issue

Fix these issues in Kilo Cloud

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

None.

Files Reviewed (1 files)
  • cloudflare-code-review-infra/src/code-review-orchestrator.ts - 1 issue

Reviewed by gpt-5.4-20260305 · 304,699 tokens

…t races

Add a conditional UPDATE ... WHERE status = 'pending' with RETURNING to
atomically claim each review before dispatching. If another concurrent
dispatcher already claimed the review, skip it. This prevents the
double-dispatch race that the order swap alone introduced.
Move the atomic claim to just before the worker call (after all prep
work) to minimise the window where a crash leaves a review stuck in
'queued'. On dispatch failure, revert the claim back to 'pending'.

As a safety net, the pending-reviews query now also picks up stale
queued reviews (older than 5 minutes) that were abandoned by a crashed
process, so they become eligible for re-dispatch.
…, scope stale claim predicate

Worker dispatch failures (timeout, 5xx) now revert the claim to
'pending' and return false instead of re-throwing — the outer handler
only marks truly non-retriable errors (missing config, payload prep
failure) as 'failed'.

The atomic claim WHERE for stale-queued rows now includes the same
staleness predicate (updated_at < cutoff) as the selector query,
ensuring only one dispatcher can revive an abandoned row.
Exclude stale queued rows from active slot accounting so abandoned claims
can be reclaimed even when an owner appears to be at capacity. Keep
ambiguous worker dispatch failures queued for stale recovery instead of
reverting to pending, and persist agentVersion without rewriting status
or failing a live review if the follow-up DB write errors.
eq(cloud_agent_code_reviews.status, 'pending'),
and(
eq(cloud_agent_code_reviews.status, 'queued'),
lt(cloud_agent_code_reviews.updated_at, staleCutoff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Stale queued rows are not guaranteed to be abandoned

CodeReviewOrchestrator.updateStatus('running') explicitly continues when the DB callback fails (cloudflare-code-review-infra/src/code-review-orchestrator.ts:261), so a live review can sit in Postgres as stale queued even though its Durable Object is already running. Re-dispatching it here sends a second POST /review for the same reviewId; start() then resets that existing DO back to queued (cloudflare-code-review-infra/src/index.ts:84, cloudflare-code-review-infra/src/code-review-orchestrator.ts:365), which can launch the review twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c52af03.

agentVersion?: string;
previousCloudAgentSessionId?: string;
}): Promise<{ status: CodeReviewStatus }> {
await this.loadState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: loadState() can clobber a live orchestrator state

start() is now the duplicate-dispatch path, so it can run while this Durable Object is already streaming a review. loadState() always replaces this.state with the last persisted snapshot before the early return, which drops any in-memory changes that have not been flushed yet (for example buffered events and token usage accumulated in processEventStream()). Guard the reload with if (!this.state) so duplicate starts stay idempotent without rewinding a live review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants