Skip to content

fix(mcp): resolve poll/getline FILE* buffering mismatch causing tools/list hang#99

Open
halindrome wants to merge 5 commits intoDeusData:mainfrom
halindrome:fix/mcp-stdio-buffering
Open

fix(mcp): resolve poll/getline FILE* buffering mismatch causing tools/list hang#99
halindrome wants to merge 5 commits intoDeusData:mainfrom
halindrome:fix/mcp-stdio-buffering

Conversation

@halindrome
Copy link

Closes #98

Root Cause

cbm_mcp_server_run mixes poll() on the raw fd with getline() on a buffered FILE*. When a client sends multiple messages in rapid succession, getline() over-reads the kernel buffer into libc's FILE* buffer on the first call. Subsequent poll() calls see an empty kernel fd and block for STORE_IDLE_TIMEOUT_S (60 seconds) even though the next messages are already in the FILE* buffer.

Triggered reliably by Claude Code 2.1.80, which sends initialize + notifications/initialized + tools/list as a single burst with no inter-message delays. The MCP spec does not require delays, so this is a server-side bug.

The comment at the original call site claimed "mixing poll() on the raw fd with getline() on the buffered FILE is safe in practice"* — this is incorrect when multiple messages arrive in one kernel receive event.

Fix

Three-phase event loop in cbm_mcp_server_run:

  1. Phase 1: Non-blocking poll(timeout=0) — fast path for data in the kernel fd
  2. Phase 2: fgetc(in) + ungetc() peek — detects data already in the FILE* buffer from a prior over-reading getline(); if found, skips the blocking poll entirely
  3. Phase 3: Only when both phases confirm no data — blocking poll(STORE_IDLE_TIMEOUT_S * 1000) for idle eviction

POSIX-portable. Does not require non-blocking fd (avoids EAGAIN complexity in getline()) or GNU-only __fpending().

The inaccurate comment is corrected to document the actual hazard.

Files Changed

  • src/mcp/mcp.c — three-phase poll loop (+41 / -6 lines)
  • tests/test_mcp.cmcp_server_run_rapid_messages unit test (pipe + alarm(5))
  • scripts/test_mcp_rapid_init.py — Python integration test (simultaneous send, 5s deadline)

Test Results

  • scripts/test.sh: 2043/2043 pass
  • scripts/test_mcp_rapid_init.py: PASS against built binary and installed binary
  • Tested locally against Claude Code 2.1.80 — tools/list now responds immediately on startup

QA

Minimum 3 QA rounds per CONTRIBUTING.md guidelines. Reports will be posted as PR comments.

shanemccarron-maker and others added 4 commits March 21, 2026 11:59
- Use O_NONBLOCK + clearerr() in Phase 2 fgetc probe to preserve the
  60s idle eviction timeout when both kernel fd and FILE* buffer are
  empty (fgetc on a blocking fd would otherwise block indefinitely,
  bypassing Phase 3 poll timeout and preventing cbm_mcp_server_evict_idle)
- Add #include <fcntl.h> for fcntl()/O_NONBLOCK
- Fix comment: "two-phase" → "three-phase" (implementation has 3 phases)
- Improve Python integration test: verify id:1 (initialize) and id:2
  (tools/list) response IDs are both present, not just "tools" substring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@halindrome
Copy link
Author

QA Round 1 — Opus 4.6

Verdict: PASS WITH MINOR FINDINGS

Check Result
Build (scripts/build.sh) ✅ PASS
Unit tests (scripts/test.sh) ✅ PASS — 2043/2043
Python integration test ✅ PASS

Findings addressed in fix(mcp): address QA round 1 findings (5ca5f32)

[Minor] fgetc() blocks on blocking fd, bypassing 60s idle eviction timeout
When both the kernel fd and FILE* buffer are empty, fgetc() on a blocking fd would block indefinitely — preventing the Phase 3 poll() timeout from firing and cbm_mcp_server_evict_idle() from running. Fixed by temporarily setting O_NONBLOCK via fcntl() before the fgetc() probe, then restoring the original flags. Added clearerr(in) to clear the ferror indicator set by EAGAIN before the blocking poll.

[Minor] Missing clearerr() before blocking poll when fgetc() returns EOF via ferror
Fixed as part of the O_NONBLOCK change above — clearerr(in) is now called before entering Phase 3.

[Nit] Comment said "two-phase" but implementation has three phases
Fixed: "two-phase" → "three-phase".

[Nit] Python integration test didn't verify response IDs
Fixed: test now checks that id:1 (initialize) and id:2 (tools/list) are both present in output, not just a substring match on "tools".

- Add explicit fallback path when fcntl(F_GETFL) fails: skip the FILE*
  peek and fall through directly to blocking poll so idle eviction still
  fires on timeout (Finding 1)
- Strengthen C unit test: verify id:1 (initialize) and id:2 (tools/list)
  response IDs are both present, not just a substring match on "tools"
  (Finding 2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@halindrome
Copy link
Author

QA Round 2 — Opus 4.6

Verdict: PASS WITH MINOR FINDINGS

Check Result
Build ✅ PASS
Unit tests ✅ PASS — 2043/2043
Python integration test ✅ PASS

Findings addressed in fix(mcp): address QA round 2 findings (5b534ce)

[Minor] fcntl(F_GETFL) failure degraded to blocking fgetc()
If fcntl(F_GETFL) returned -1, the code fell through to fgetc() on a blocking fd, which would block indefinitely and bypass Phase 3's idle eviction timeout. Fixed: when saved_flags < 0, skip the FILE* peek entirely and fall straight through to the blocking poll() so eviction still fires.

[Minor] C unit test only checked substring "tools", not response IDs
mcp_server_run_rapid_messages now asserts both "id":1 (initialize response) and "id":2 (tools/list response) are present in output, matching the Python integration test's validation.

Pre-existing issues noted but not in scope for this PR

  • Windows WaitForSingleObject path has the same conceptual FILE* buffering issue (pre-existing, Windows not primary platform)
  • No POLLHUP/POLLERR check on poll() return (pre-existing style choice; getline() handles it gracefully)

@halindrome
Copy link
Author

QA Round 3 (Final) — Opus 4.6

Verdict: PASS — no new findings

Check Result
Build ✅ PASS — zero warnings with -Wall -Wextra -Werror
Unit tests ✅ PASS — 2043/2043
Python integration test ✅ PASS

All findings from rounds 1 and 2 have been resolved. The three-phase poll loop is correct and well-structured. No further issues identified.

This PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working editor/integration Editor compatibility and CLI integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(mcp): tools/list hangs 60s when client sends initialize + notifications/initialized + tools/list in rapid succession

3 participants