fix: resilient worker accept loop — survive client disconnects by cjchanh · Pull Request #80 · evilsocket/cake

cjchanh · 2026-04-11T00:09:39Z

Summary

The worker accept loop (Worker::run) silently exits on any accept() error, permanently killing the TCP listener. This causes a "one connection works, then dead worker" failure mode.

On iOS this is particularly severe: the first master disconnect (broken pipe, network hiccup, master restart) leaves the worker completely unreachable until the app is force-killed and relaunched.

Changes

Hold TcpListener in Option for in-place rebinding
Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately
On fatal accept errors, drop and rebind the listener on the same port without reloading model weights
Log listener fd and local address before each accept cycle for diagnostics
Add describe_listener() helper for cross-platform fd reporting

Testing

cargo test -p cake-core --lib — 589 passed, 0 failed
Live iOS worker (iPad Air M3, iPadOS 26.2): 20 sequential inference requests over 6 minutes with zero drops or reconnections needed
Before this fix: worker died after first master disconnect, every time

Diagnostic log lines (after fix)

worker accept loop awaiting master on 0.0.0.0:10128 fd=9
[10.0.0.203:52341] connection loop ended: broken pipe
worker accept loop awaiting master on 0.0.0.0:10128 fd=9   ← re-enters accept

Recovery path (fatal accept error):

accept failed on 0.0.0.0:10128; dropping listener and rebinding ...
worker listener bound on 0.0.0.0:10128 fd=12

Fixes #79

The worker accept loop (`Worker::run`) previously used `while let Ok(...) = self.listener.accept()` which silently exited on any accept error, permanently killing the TCP listener. This caused a "one connection works, then dead worker" failure mode, particularly severe on iOS where the first master disconnect left the worker unreachable until a full app restart. Changes: - Hold TcpListener in Option for in-place rebinding - Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately instead of exiting - On fatal accept errors, drop and rebind the listener on the same port without reloading model weights - Log listener fd and local address before each accept cycle for diagnostics - Add describe_listener() helper for cross-platform fd reporting Tested: 20 sequential inference requests over 6 minutes on an iOS worker (iPad M3) with zero drops or reconnections needed. Fixes evilsocket#79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resilient worker accept loop — survive client disconnects#80

fix: resilient worker accept loop — survive client disconnects#80
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience

cjchanh commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjchanh commented Apr 11, 2026

Summary

Changes

Testing

Diagnostic log lines (after fix)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant