fix: issue 842 pool crash recovery#1946
Open
eastwood-c wants to merge 2 commits intounclecode:developfrom
Open
fix: issue 842 pool crash recovery#1946eastwood-c wants to merge 2 commits intounclecode:developfrom
eastwood-c wants to merge 2 commits intounclecode:developfrom
Conversation
Tests for the three-layer crash recovery mechanism added in the
bugfix/issue-842-pool-crash-recovery PR.
TestIsCrawlerAlive (12 tests):
- Standard browser: alive/dead via transport reference, is_connected() lag,
transport=None override
- Persistent context: alive/dead paths
- Edge cases: missing strategy/browser_manager/browser, exception safety
TestReplacePermanent (4 tests):
- Closes old browser via _close_and_kill
- Creates new browser with the config saved at init_permanent() time
- Updates PERMANENT global and _PERMANENT_STARTED_AT
TestReplacePooled (2 tests):
- Closes and replaces a dead hot-pool entry
- New browser starts with active_requests=1
TestGetCrawlerRecovery (4 tests):
- Dead permanent/hot/cold → respective replace helper is called
- Alive permanent → no replacement triggered
TestJanitorHealthSweep (4 tests):
- Dead permanent → _replace_permanent called with reason='dead'
- Dead hot/cold → evicted and killed
- Alive hot → left untouched
49 total tests across all three test files pass.
Refs unclecode#842
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #842
Refs #997, #1326, #1842
Problem
The Docker API's tiered browser pool (PERMANENT / HOT / COLD) has no health checking.
Once Chromium dies from OOM or
/dev/shmexhaustion the pool continues returning thedead
AsyncWebCrawlerreference. Every subsequent request fails with:or:
This continues indefinitely until the container is manually restarted.
In a Docker deployment (12 GB memory,
MAX_CONCURRENT_TASKS=3) the permanent browserdies after several hours of sustained crawling. When it does, all concurrent requests
enter
get_crawler()and receive the same dead reference — yielding cascading timeouts:Root Cause
Commit
d8cbeff(fixing #842) resetBrowserManager._playwright_instance = Nonein theclose()path, which handles intentional close + reopen. It does nothing for theproduction scenario where Chromium is killed unexpectedly while the pool holds a live
Python reference to the dead browser.
get_crawler()never checks whether a browser is actually responsive before returning it.Solution
Three-layer crash recovery, all in
crawler_pool.py:1.
_is_crawler_alive(crawler)— liveness detectionInspects Playwright's internal state without a live probe (we hold the pool
LOCKanda probe could hang).
browser.is_connected()is updated via an async callback chain(
Connection.close()→Browser._on_close()). It can returnTruefor severalevent-loop cycles after the OS pipe has already closed, creating a window where a dead
browser passes the liveness check.
Instead, we check
_channel._connection._transportdirectly. This reference iscleared to
Nonesynchronously insideConnection.close()the instant asyncio firesconnection_lost()— giving an immediate signal independent of the async callbackbacklog.
browser._channel._connection._transport is not None(fast path),with
browser.is_connected()as fallbackcontext._impl_obj._connection._transport is not None(persistent contexts don't expose
is_connected())2. On-demand recovery in
get_crawler()Before returning any pooled browser, checks
_is_crawler_alive(). If dead:_replace_permanent(): closes dead crawler, resetsBrowserManager._playwright_instance, starts fresh instance using theBrowserConfigsaved in_PERMANENT_CFGduringinit_permanent()_replace_pooled(): same for non-default configs3. Proactive janitor health sweep
The
janitor()loop now scans all three tiers before idle-timeout cleanup. Deadhot/cold entries are evicted; a dead permanent triggers
_replace_permanent(). Thisensures the next
get_crawler()gets a healthy browser even without a request arrivingfirst.
Summary
Three helper functions added (
_is_crawler_alive,_replace_permanent,_replace_pooled) and one module-level variable (_PERMANENT_CFG). Existing functionsget_crawler,init_permanent, andjanitoreach receive targeted changes to callthese helpers.
_is_crawler_aliveuses_channel._connection._transportas the primary dead-browsersignal for regular browsers rather than
is_connected(), which lags behind the actualpipe close due to async callback scheduling. The transport reference is cleared
synchronously and is therefore reliable across all timing scenarios.
List of files changed and why
deploy/docker/crawler_pool.pyHow Has This Been Tested?
All tests pass:
tests/docker/test_pool_release.pyrelease_crawler()andactive_requestslifecycletests/docker/test_crash_recovery.py_is_crawler_alive,_replace_permanent,_replace_pooled,get_crawler()recovery paths,janitor()health sweeptests/test_issue_1842_browser_none.pyBrowserManager.create_browser_context()guard whenself.browserisNonetests/docker/test_crash_recovery.pyis new in this PR — it directly tests every function added tocrawler_pool.py, including the criticaltransport=Noneoverride ofis_connected()that is the core of the fix (seetest_transport_none_overrides_is_connected).Docker build:
In-container integration (run inside the container):
Production deployment:
This patch has been running in a production Docker deployment since the fix was applied and continues to be monitored. During that period a timing edge case was identified —
browser.is_connected()lags behind the actual OS pipe close event — and addressed by switching to a direct transport check (_channel._connection._transport), which is cleared synchronously when the pipe closes.Checklist: