Skip to content

fix: resilient worker accept loop — survive client disconnects#80

Open
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience
Open

fix: resilient worker accept loop — survive client disconnects#80
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 11, 2026

Summary

The worker accept loop (Worker::run) silently exits on any accept() error, permanently killing the TCP listener. This causes a "one connection works, then dead worker" failure mode.

On iOS this is particularly severe: the first master disconnect (broken pipe, network hiccup, master restart) leaves the worker completely unreachable until the app is force-killed and relaunched.

Changes

  • Hold TcpListener in Option for in-place rebinding
  • Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately
  • On fatal accept errors, drop and rebind the listener on the same port without reloading model weights
  • Log listener fd and local address before each accept cycle for diagnostics
  • Add describe_listener() helper for cross-platform fd reporting

Testing

  • cargo test -p cake-core --lib — 589 passed, 0 failed
  • Live iOS worker (iPad Air M3, iPadOS 26.2): 20 sequential inference requests over 6 minutes with zero drops or reconnections needed
  • Before this fix: worker died after first master disconnect, every time

Diagnostic log lines (after fix)

worker accept loop awaiting master on 0.0.0.0:10128 fd=9
[10.0.0.203:52341] connection loop ended: broken pipe
worker accept loop awaiting master on 0.0.0.0:10128 fd=9   ← re-enters accept

Recovery path (fatal accept error):

accept failed on 0.0.0.0:10128; dropping listener and rebinding ...
worker listener bound on 0.0.0.0:10128 fd=12

Fixes #79

The worker accept loop (`Worker::run`) previously used
`while let Ok(...) = self.listener.accept()` which silently exited on
any accept error, permanently killing the TCP listener. This caused a
"one connection works, then dead worker" failure mode, particularly
severe on iOS where the first master disconnect left the worker
unreachable until a full app restart.

Changes:
- Hold TcpListener in Option for in-place rebinding
- Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock)
  and retry immediately instead of exiting
- On fatal accept errors, drop and rebind the listener on the same port
  without reloading model weights
- Log listener fd and local address before each accept cycle for
  diagnostics
- Add describe_listener() helper for cross-platform fd reporting

Tested: 20 sequential inference requests over 6 minutes on an iOS
worker (iPad M3) with zero drops or reconnections needed.

Fixes evilsocket#79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iOS worker: TCP listener accepts SYN probes but refuses full connect() from master

1 participant