Skip to content

fix(bridge): subprocess reaping + health probe + concurrent-stream safety#39

Open
drewstone wants to merge 1 commit into
mainfrom
fix/bridge-health-watchdog-stability
Open

fix(bridge): subprocess reaping + health probe + concurrent-stream safety#39
drewstone wants to merge 1 commit into
mainfrom
fix/bridge-health-watchdog-stability

Conversation

@drewstone
Copy link
Copy Markdown
Owner

Bridges (3395-3399) crash every 1-2h under review load. Watchdog catches it but reviews error out mid-flight with Bridge streaming failed: [Errno 111] Connection refused (observed today during creative#62 and gtm#82 reviews).

Root cause

Bridge backends spawn CLI subprocess children (kimi, opencode, claude, codex) but never reap them on client disconnect or timeout. Children stay alive for 20+ hours, leak fd's and memory, eventually trigger OOM in the bridge process itself. Observed today: 23 orphan opencode children dating back from May 14 still alive.

Fixes

  • New executors/process-tree.ts: walk + kill the whole child tree on disconnect, not just the immediate spawn.
  • All 5 backends (claude/codex/kimi/opencode/pi) now register a cleanup handler that fires on AbortSignal / connection-close.
  • Health probe (routes/health.ts): reports child-process count, RSS, uptime + a busy flag so the watchdog can distinguish "healthy but processing a long request" from "wedged".
  • Concurrent-stream safety: backends previously assumed serial; now use a per-client request slot so N parallel SSE consumers don't step on each other's stdout.
  • url-translate helper + routes/translate.ts: small utility to rewrite localhost URLs for sidecar-vs-host calls (used by tests).

Tests

  • docker-executor.test.ts + smoke.test.ts: load-test asserts no leaked subprocesses + stable RSS under 30s of concurrent streams.

…fety

Bridges (3395-3399) crash every 1-2h under review load. Watchdog
catches it but reviews error out mid-flight on "Bridge streaming
failed: [Errno 111] Connection refused".

Root cause: bridge backends spawn CLI subprocess children (kimi,
opencode, claude, codex) but never reap them on client disconnect or
timeout. Children stay alive for 20+ hours, leak fd's and memory,
eventually trigger OOM in the bridge process itself.

Fixes:
- New executors/process-tree.ts: walk + kill the whole child tree on
  disconnect, not just the immediate spawn.
- All 5 backends (claude/codex/kimi/opencode/pi) now register a
  cleanup handler that fires on AbortSignal / connection-close.
- Health probe (routes/health.ts) reports child-process count, RSS,
  uptime + a 'busy' flag so the watchdog can distinguish 'healthy
  but processing a long request' from 'wedged'.
- Concurrent-stream safety: backends previously assumed serial; now
  use a per-client request slot so N parallel SSE consumers don't
  step on each other's stdout.
- url-translate helper + routes/translate.ts: small utility to
  rewrite localhost URLs for sidecar-vs-host calls (used by tests).

Tests:
- docker-executor.test.ts + smoke.test.ts: load-test asserts no leaked
  subprocesses + stable RSS under 30s of concurrent streams.

Branch out, force-push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant