fix: issue 842 pool crash recovery by eastwood-c · Pull Request #1946 · unclecode/crawl4ai

eastwood-c · 2026-04-28T15:42:47Z

Problem

The Docker API's tiered browser pool (PERMANENT / HOT / COLD) has no health checking.
Once Chromium dies from OOM or /dev/shm exhaustion the pool continues returning the
dead AsyncWebCrawler reference. Every subsequent request fails with:

BrowserContext.new_page: Connection closed while reading from the driver

or:

Page.content: Target page, context or browser has been closed

This continues indefinitely until the container is manually restarted.

In a Docker deployment (12 GB memory, MAX_CONCURRENT_TASKS=3) the permanent browser
dies after several hours of sustained crawling. When it does, all concurrent requests
enter get_crawler() and receive the same dead reference — yielding cascading timeouts:

WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum news")
WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum report")
WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum analysis")
...  (8 concurrent queries all fail simultaneously)

Root Cause

Commit d8cbeff (fixing #842) reset BrowserManager._playwright_instance = None in the
close() path, which handles intentional close + reopen. It does nothing for the
production scenario where Chromium is killed unexpectedly while the pool holds a live
Python reference to the dead browser.

get_crawler() never checks whether a browser is actually responsive before returning it.

Solution

Three-layer crash recovery, all in crawler_pool.py:

1. _is_crawler_alive(crawler) — liveness detection

Inspects Playwright's internal state without a live probe (we hold the pool LOCK and
a probe could hang).

browser.is_connected() is updated via an async callback chain
(Connection.close() → Browser._on_close()). It can return True for several
event-loop cycles after the OS pipe has already closed, creating a window where a dead
browser passes the liveness check.

Instead, we check _channel._connection._transport directly. This reference is
cleared to None synchronously inside Connection.close() the instant asyncio fires
connection_lost() — giving an immediate signal independent of the async callback
backlog.

Standard browsers: browser._channel._connection._transport is not None (fast path),
with browser.is_connected() as fallback
Persistent contexts: context._impl_obj._connection._transport is not None
(persistent contexts don't expose is_connected())

2. On-demand recovery in get_crawler()

Before returning any pooled browser, checks _is_crawler_alive(). If dead:

Permanent pool → _replace_permanent(): closes dead crawler, resets
BrowserManager._playwright_instance, starts fresh instance using the
BrowserConfig saved in _PERMANENT_CFG during init_permanent()
Hot/Cold pools → _replace_pooled(): same for non-default configs

3. Proactive janitor health sweep

The janitor() loop now scans all three tiers before idle-timeout cleanup. Dead
hot/cold entries are evicted; a dead permanent triggers _replace_permanent(). This
ensures the next get_crawler() gets a healthy browser even without a request arriving
first.

Summary

Three helper functions added (_is_crawler_alive, _replace_permanent,
_replace_pooled) and one module-level variable (_PERMANENT_CFG). Existing functions
get_crawler, init_permanent, and janitor each receive targeted changes to call
these helpers.

_is_crawler_alive uses _channel._connection._transport as the primary dead-browser
signal for regular browsers rather than is_connected(), which lags behind the actual
pipe close due to async callback scheduling. The transport reference is cleared
synchronously and is therefore reliable across all timing scenarios.

List of files changed and why

File	Reason
`deploy/docker/crawler_pool.py`	All crash-recovery logic is contained here. No public API changes, no new dependencies.

How Has This Been Tested?

All tests pass:

# uv run uses the global pytest (wrong Python); run via the venv directly:
$ uv pip install pytest pytest-asyncio   # once after uv sync
$ .venv/bin/python -m pytest \
    tests/docker/test_pool_release.py \
    tests/docker/test_crash_recovery.py \
    tests/test_issue_1842_browser_none.py -v
49 passed in 0.55s

Test file	Tests	What it covers
`tests/docker/test_pool_release.py`	10	`release_crawler()` and `active_requests` lifecycle
`tests/docker/test_crash_recovery.py`	26	`_is_crawler_alive`, `_replace_permanent`, `_replace_pooled`, `get_crawler()` recovery paths, `janitor()` health sweep
`tests/test_issue_1842_browser_none.py`	13	`BrowserManager.create_browser_context()` guard when `self.browser` is `None`

tests/docker/test_crash_recovery.py is new in this PR — it directly tests every function added to crawler_pool.py, including the critical transport=None override of is_connected() that is the core of the fix (see test_transport_none_overrides_is_connected).

Docker build:

docker compose build crawl4ai

In-container integration (run inside the container):

# Normal path
await init_permanent(cfg)        # OK
crawler = await get_crawler(cfg) # active_requests=1
_is_crawler_alive(crawler)       # True
await release_crawler(crawler)   # active_requests=0

# Crash recovery path
await browser.close()            # simulate OOM kill
_is_crawler_alive(PERMANENT)     # False  ← correctly detected
crawler = await get_crawler(cfg) # auto-recovered, new browser alive, active_requests=1

Production deployment:

This patch has been running in a production Docker deployment since the fix was applied and continues to be monitored. During that period a timing edge case was identified — browser.is_connected() lags behind the actual OS pipe close event — and addressed by switching to a direct transport check (_channel._connection._transport), which is cleared synchronously when the pipe closes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Tests for the three-layer crash recovery mechanism added in the bugfix/issue-842-pool-crash-recovery PR. TestIsCrawlerAlive (12 tests): - Standard browser: alive/dead via transport reference, is_connected() lag, transport=None override - Persistent context: alive/dead paths - Edge cases: missing strategy/browser_manager/browser, exception safety TestReplacePermanent (4 tests): - Closes old browser via _close_and_kill - Creates new browser with the config saved at init_permanent() time - Updates PERMANENT global and _PERMANENT_STARTED_AT TestReplacePooled (2 tests): - Closes and replaces a dead hot-pool entry - New browser starts with active_requests=1 TestGetCrawlerRecovery (4 tests): - Dead permanent/hot/cold → respective replace helper is called - Alive permanent → no replacement triggered TestJanitorHealthSweep (4 tests): - Dead permanent → _replace_permanent called with reason='dead' - Dead hot/cold → evicted and killed - Alive hot → left untouched 49 total tests across all three test files pass. Refs unclecode#842

eastwood-c added 2 commits April 28, 2026 15:32

fix(docker): add browser crash recovery to crawler pool

4bf87ec

eastwood-c changed the title ~~Bugfix/issue 842 pool crash recovery~~ fix: issue 842 pool crash recovery Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: issue 842 pool crash recovery#1946

fix: issue 842 pool crash recovery#1946
eastwood-c wants to merge 2 commits intounclecode:developfrom
eastwood-c:bugfix/issue-842-pool-crash-recovery

eastwood-c commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

eastwood-c commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eastwood-c commented Apr 28, 2026 •

edited

Loading