Skip to content

fix: reconnect MCP client on terminated session#8694

Open
EterUltimate wants to merge 1 commit into
AstrBotDevs:masterfrom
EterUltimate:codex/automation-4-mcp-reconnect-20260609
Open

fix: reconnect MCP client on terminated session#8694
EterUltimate wants to merge 1 commit into
AstrBotDevs:masterfrom
EterUltimate:codex/automation-4-mcp-reconnect-20260609

Conversation

@EterUltimate

@EterUltimate EterUltimate commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Treat MCP Session terminated failures as reconnectable errors, while keeping generic terminated text out of the retry path.
  • Preserve the existing tenacity retry/backoff flow and only catch Exception, so cancellation/shutdown signals still propagate.
  • Add a regression test for reconnect-on-session-terminated and narrow error matching.

Fixes #8681.

Evidence

Verification

  • git diff --check
  • uv run pytest tests/unit/test_mcp_client_reconnect.py tests/unit/test_mcp_client_schema.py -q (7 passed)
  • uv run ruff check .
  • uv run ruff format --check .
  • uv run astrbot --help

Local full-suite note

  • uv run pytest tests/ -q initially failed because pytest could not access C:\Users\zacza\AppData\Local\Temp\pytest-of-zacza.
  • Rerun with --basetemp C:\tmp\pytest-automation-4-mcp completed collection/execution but exposed 21 existing Windows-path/newline/fixture failures unrelated to astrbot/core/agent/mcp_client.py.

Summary by Sourcery

Handle MCP client reconnects on session-terminated errors while preserving existing retry behavior and error propagation.

Bug Fixes:

  • Treat MCP 'session terminated' errors as reconnectable so MCP clients can recover after server restarts.
  • Avoid overbroad retry matching by restricting reconnect logic to specific MCP session termination messages.

Tests:

  • Add regression tests to verify reconnect behavior on session-terminated errors and ensure narrow error detection.

@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels Jun 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the MCP client's reconnection logic to trigger on a broader set of session termination errors, rather than just anyio.ClosedResourceError. It introduces a helper function _is_mcp_reconnect_error to identify these errors, updates the retry decorator, and adds comprehensive unit tests. The review feedback points out a potential NameError at runtime if anyio is not installed, as it is an optional dependency, and suggests guarding the check with "anyio" in globals().

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +117 to +122
def _is_mcp_reconnect_error(exc: BaseException) -> bool:
if isinstance(exc, anyio.ClosedResourceError):
return True

message = str(exc).lower()
return any(marker in message for marker in _MCP_RECONNECT_ERROR_MESSAGES)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since anyio is an optional dependency (imported inside a try...except block on line 100), referencing anyio.ClosedResourceError directly will raise a NameError at runtime if anyio is not installed and _is_mcp_reconnect_error is executed (for example, when call_tool_with_reconnect raises a ValueError because self.session is None). Guarding the check with "anyio" in globals() prevents this runtime error.

Suggested change
def _is_mcp_reconnect_error(exc: BaseException) -> bool:
if isinstance(exc, anyio.ClosedResourceError):
return True
message = str(exc).lower()
return any(marker in message for marker in _MCP_RECONNECT_ERROR_MESSAGES)
def _is_mcp_reconnect_error(exc: BaseException) -> bool:
if "anyio" in globals() and isinstance(exc, anyio.ClosedResourceError):
return True
message = str(exc).lower()
return any(marker in message for marker in _MCP_RECONNECT_ERROR_MESSAGES)

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="tests/unit/test_mcp_client_reconnect.py" line_range="30-33" />
<code_context>
+        }
+
+
+def test_mcp_reconnect_error_detection_is_narrow() -> None:
+    assert mcp_client._is_mcp_reconnect_error(RuntimeError("Session terminated"))
+    assert not mcp_client._is_mcp_reconnect_error(
+        RuntimeError("business flow terminated normally")
+    )
+
</code_context>
<issue_to_address>
**suggestion (testing):** Add more edge cases to the reconnect error detection test to better pin down the intended matching behavior.

Since `_is_mcp_reconnect_error` drives retry behavior, it would be good to exercise a few more explicit cases here. Consider adding:

- An `anyio.ClosedResourceError` to cover the type-based branch.
- A case-insensitivity check (e.g. `"SESSION TERMINATED"`).
- The alternate message `"session was terminated"` from `_MCP_RECONNECT_ERROR_MESSAGES`.
- A `RuntimeError("terminated")` or similar to show that plain "terminated" without "session" is not treated as reconnectable.

Parametrizing these cases would clearly document the contract and help prevent regressions in the matching logic.

Suggested implementation:

```python
import anyio
import pytest
from tenacity import wait_none

from astrbot.core.agent import mcp_client

```

```python
@pytest.mark.parametrize(
    ("error", "expected"),
    [
        # Exact known reconnectable message
        (RuntimeError("Session terminated"), True),
        # Case insensitivity
        (RuntimeError("SESSION TERMINATED"), True),
        # Alternate configured reconnectable message
        (RuntimeError("session was terminated"), True),
        # Type-based reconnectable error
        (anyio.ClosedResourceError(), True),
        # Non-reconnectable messages
        (RuntimeError("business flow terminated normally"), False),
        # "terminated" without "session" should not be treated as reconnectable
        (RuntimeError("terminated"), False),
    ],
)
def test_mcp_reconnect_error_detection_is_narrow(error: BaseException, expected: bool) -> None:
    assert mcp_client._is_mcp_reconnect_error(error) is expected

```
</issue_to_address>

### Comment 2
<location path="tests/unit/test_mcp_client_reconnect.py" line_range="37-46" />
<code_context>
+    )
+
+
+@pytest.mark.asyncio
+async def test_call_tool_reconnects_on_session_terminated(monkeypatch) -> None:
+    monkeypatch.setattr(mcp_client, "wait_exponential", lambda **_: wait_none())
+
+    client = mcp_client.MCPClient()
+    session = FlakyMcpSession()
+    reconnects = 0
+
+    async def reconnect() -> None:
+        nonlocal reconnects
+        reconnects += 1
+        client.session = session
+
+    client.session = session
+    client._reconnect = reconnect
+
+    result = await client.call_tool_with_reconnect(
+        tool_name="lookup",
+        arguments={"url": "https://example.com"},
+        read_timeout_seconds=timedelta(seconds=5),
+    )
+
+    assert result == {
+        "name": "lookup",
+        "arguments": {"url": "https://example.com"},
+        "timeout": 5.0,
+    }
+    assert session.calls == 2
+    assert reconnects == 1
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test ensuring non-reconnectable exceptions are not retried and propagate instead of triggering reconnect.

To exercise the new narrow error handling, please add a test where the first call raises a non-reconnectable exception (e.g., `ValueError("business logic failed")`) and assert that no reconnects occur (e.g., `reconnects == 0`) and the exception is re-raised (e.g., `with pytest.raises(ValueError): call_tool_with_reconnect(...)`). This will cover the `if not _is_mcp_reconnect_error(exc): raise` path and guard against masking unrelated failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tests/unit/test_mcp_client_reconnect.py Outdated
Comment on lines +30 to +33
def test_mcp_reconnect_error_detection_is_narrow() -> None:
assert mcp_client._is_mcp_reconnect_error(RuntimeError("Session terminated"))
assert not mcp_client._is_mcp_reconnect_error(
RuntimeError("business flow terminated normally")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add more edge cases to the reconnect error detection test to better pin down the intended matching behavior.

Since _is_mcp_reconnect_error drives retry behavior, it would be good to exercise a few more explicit cases here. Consider adding:

  • An anyio.ClosedResourceError to cover the type-based branch.
  • A case-insensitivity check (e.g. "SESSION TERMINATED").
  • The alternate message "session was terminated" from _MCP_RECONNECT_ERROR_MESSAGES.
  • A RuntimeError("terminated") or similar to show that plain "terminated" without "session" is not treated as reconnectable.

Parametrizing these cases would clearly document the contract and help prevent regressions in the matching logic.

Suggested implementation:

import anyio
import pytest
from tenacity import wait_none

from astrbot.core.agent import mcp_client
@pytest.mark.parametrize(
    ("error", "expected"),
    [
        # Exact known reconnectable message
        (RuntimeError("Session terminated"), True),
        # Case insensitivity
        (RuntimeError("SESSION TERMINATED"), True),
        # Alternate configured reconnectable message
        (RuntimeError("session was terminated"), True),
        # Type-based reconnectable error
        (anyio.ClosedResourceError(), True),
        # Non-reconnectable messages
        (RuntimeError("business flow terminated normally"), False),
        # "terminated" without "session" should not be treated as reconnectable
        (RuntimeError("terminated"), False),
    ],
)
def test_mcp_reconnect_error_detection_is_narrow(error: BaseException, expected: bool) -> None:
    assert mcp_client._is_mcp_reconnect_error(error) is expected

Comment on lines +37 to +46
@pytest.mark.asyncio
async def test_call_tool_reconnects_on_session_terminated(monkeypatch) -> None:
monkeypatch.setattr(mcp_client, "wait_exponential", lambda **_: wait_none())

client = mcp_client.MCPClient()
session = FlakyMcpSession()
reconnects = 0

async def reconnect() -> None:
nonlocal reconnects

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test ensuring non-reconnectable exceptions are not retried and propagate instead of triggering reconnect.

To exercise the new narrow error handling, please add a test where the first call raises a non-reconnectable exception (e.g., ValueError("business logic failed")) and assert that no reconnects occur (e.g., reconnects == 0) and the exception is re-raised (e.g., with pytest.raises(ValueError): call_tool_with_reconnect(...)). This will cover the if not _is_mcp_reconnect_error(exc): raise path and guard against masking unrelated failures.

@EterUltimate EterUltimate force-pushed the codex/automation-4-mcp-reconnect-20260609 branch from b9a2502 to 7985fdd Compare June 9, 2026 02:22
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]重启MCP服务器后,AstrBot一直报错Session terminated

1 participant