Ride out daemon restarts in MCP daemon clients#179
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9e6a2a687
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
`tracedecay update` restarts the daemon service; between the old daemon unlinking its socket and the new one binding it, client connects fail with NotFound/ConnectionRefused. Retry transient connect errors for up to 8s (200ms poll) in the serve stdio proxy and CLI call_tool paths so live MCP sessions survive a self-update. Non-transient errors still fail fast, and retries only happen before any bytes are written. Also drop the listener and unlink the socket before draining engine state on shutdown, so clients connecting during the drain window get a retryable error instead of a connection that is never served, and add restart-aware hints to connect/close errors.
1788694 to
2410265
Compare
Summary
tracedecay updatereplaces the binary and (viapost-update) restarts the daemon service withsystemctl --user restart tracedecay.service. The old daemon unlinks its socket on shutdown, and the new one rebinds it a moment later. Because Cursor'stracedecay servestdio proxy opens a fresh socket connection per request (send_daemon_request_line), any MCP request landing in that restart window failed instantly with a hard JSON-RPC error (No such file or directory/Connection refused) even though the daemon was back milliseconds later.This PR makes live MCP sessions ride out a self-update without breaking:
connect_with_restart_grace: retries transient connect failures (NotFound,ConnectionRefused) for up to 8s (200ms poll) before erroring. Retrying at connect time is safe — nothing has been written yet, so requests can't be duplicated. Non-transient errors (e.g. permission denied) still fail immediately.servestdio proxy (send_daemon_request_line) and the CLI daemon tool-call path (call_tool).tracedecay update, and the mid-request "daemon closed the connection" error now says the daemon may have restarted and the request can be retried.Not changed (assessment): the daemon restart itself already works —
tracedecay updateruns the new binary'spost-update, which rewrites the systemd unit and restarts the service, and the per-request reconnect design means the proxy transparently talks to the new daemon afterwards. The oldserveprocess keeps running the old binary as a thin line proxy, which is compatible since all MCP logic lives daemon-side.Test plan
src/daemon.rs:connect_with_restart_grace_reconnects_once_daemon_rebinds— socket absent at first, daemon binds 300ms later, connect succeeds.connect_with_restart_grace_gives_up_with_restart_hint— no daemon ever binds; error names the socket and hints at the update/restart cause.proxied_request_survives_daemon_restart_window— fullsend_daemon_request_lineround trip against a fake daemon that only starts listening mid-request.transient_daemon_connect_errors_cover_restart_window_only— error-kind classification.cargo test --lib daemon::(29 passed)cargo test --test mcp_cli_serve_test --test tool_daemon_test(31 passed)cargo clippy --all-targetsandcargo fmt --checkclean