fix(bridge): subprocess reaping + health probe + concurrent-stream safety#39
Open
drewstone wants to merge 1 commit into
Open
fix(bridge): subprocess reaping + health probe + concurrent-stream safety#39drewstone wants to merge 1 commit into
drewstone wants to merge 1 commit into
Conversation
…fety Bridges (3395-3399) crash every 1-2h under review load. Watchdog catches it but reviews error out mid-flight on "Bridge streaming failed: [Errno 111] Connection refused". Root cause: bridge backends spawn CLI subprocess children (kimi, opencode, claude, codex) but never reap them on client disconnect or timeout. Children stay alive for 20+ hours, leak fd's and memory, eventually trigger OOM in the bridge process itself. Fixes: - New executors/process-tree.ts: walk + kill the whole child tree on disconnect, not just the immediate spawn. - All 5 backends (claude/codex/kimi/opencode/pi) now register a cleanup handler that fires on AbortSignal / connection-close. - Health probe (routes/health.ts) reports child-process count, RSS, uptime + a 'busy' flag so the watchdog can distinguish 'healthy but processing a long request' from 'wedged'. - Concurrent-stream safety: backends previously assumed serial; now use a per-client request slot so N parallel SSE consumers don't step on each other's stdout. - url-translate helper + routes/translate.ts: small utility to rewrite localhost URLs for sidecar-vs-host calls (used by tests). Tests: - docker-executor.test.ts + smoke.test.ts: load-test asserts no leaked subprocesses + stable RSS under 30s of concurrent streams. Branch out, force-push.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bridges (3395-3399) crash every 1-2h under review load. Watchdog catches it but reviews error out mid-flight with
Bridge streaming failed: [Errno 111] Connection refused(observed today during creative#62 and gtm#82 reviews).Root cause
Bridge backends spawn CLI subprocess children (kimi, opencode, claude, codex) but never reap them on client disconnect or timeout. Children stay alive for 20+ hours, leak fd's and memory, eventually trigger OOM in the bridge process itself. Observed today: 23 orphan opencode children dating back from May 14 still alive.
Fixes
executors/process-tree.ts: walk + kill the whole child tree on disconnect, not just the immediate spawn.routes/health.ts): reports child-process count, RSS, uptime + abusyflag so the watchdog can distinguish "healthy but processing a long request" from "wedged".routes/translate.ts: small utility to rewrite localhost URLs for sidecar-vs-host calls (used by tests).Tests
docker-executor.test.ts+smoke.test.ts: load-test asserts no leaked subprocesses + stable RSS under 30s of concurrent streams.