Skip to content

fix: issue 842 pool crash recovery#1946

Open
eastwood-c wants to merge 2 commits intounclecode:developfrom
eastwood-c:bugfix/issue-842-pool-crash-recovery
Open

fix: issue 842 pool crash recovery#1946
eastwood-c wants to merge 2 commits intounclecode:developfrom
eastwood-c:bugfix/issue-842-pool-crash-recovery

Conversation

@eastwood-c
Copy link
Copy Markdown

@eastwood-c eastwood-c commented Apr 28, 2026

Fixes #842
Refs #997, #1326, #1842

Problem

The Docker API's tiered browser pool (PERMANENT / HOT / COLD) has no health checking.
Once Chromium dies from OOM or /dev/shm exhaustion the pool continues returning the
dead AsyncWebCrawler reference. Every subsequent request fails with:

BrowserContext.new_page: Connection closed while reading from the driver

or:

Page.content: Target page, context or browser has been closed

This continues indefinitely until the container is manually restarted.

In a Docker deployment (12 GB memory, MAX_CONCURRENT_TASKS=3) the permanent browser
dies after several hours of sustained crawling. When it does, all concurrent requests
enter get_crawler() and receive the same dead reference — yielding cascading timeouts:

WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum news")
WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum report")
WARN  Crawl4AI timeout on first attempt — retrying once (query: "lorem ipsum analysis")
...  (8 concurrent queries all fail simultaneously)

Root Cause

Commit d8cbeff (fixing #842) reset BrowserManager._playwright_instance = None in the
close() path, which handles intentional close + reopen. It does nothing for the
production scenario where Chromium is killed unexpectedly while the pool holds a live
Python reference to the dead browser.

get_crawler() never checks whether a browser is actually responsive before returning it.

Solution

Three-layer crash recovery, all in crawler_pool.py:

1. _is_crawler_alive(crawler) — liveness detection

Inspects Playwright's internal state without a live probe (we hold the pool LOCK and
a probe could hang).

browser.is_connected() is updated via an async callback chain
(Connection.close()Browser._on_close()). It can return True for several
event-loop cycles after the OS pipe has already closed, creating a window where a dead
browser passes the liveness check.

Instead, we check _channel._connection._transport directly. This reference is
cleared to None synchronously inside Connection.close() the instant asyncio fires
connection_lost() — giving an immediate signal independent of the async callback
backlog.

  • Standard browsers: browser._channel._connection._transport is not None (fast path),
    with browser.is_connected() as fallback
  • Persistent contexts: context._impl_obj._connection._transport is not None
    (persistent contexts don't expose is_connected())

2. On-demand recovery in get_crawler()

Before returning any pooled browser, checks _is_crawler_alive(). If dead:

  • Permanent pool → _replace_permanent(): closes dead crawler, resets
    BrowserManager._playwright_instance, starts fresh instance using the
    BrowserConfig saved in _PERMANENT_CFG during init_permanent()
  • Hot/Cold pools → _replace_pooled(): same for non-default configs

3. Proactive janitor health sweep

The janitor() loop now scans all three tiers before idle-timeout cleanup. Dead
hot/cold entries are evicted; a dead permanent triggers _replace_permanent(). This
ensures the next get_crawler() gets a healthy browser even without a request arriving
first.

Summary

Three helper functions added (_is_crawler_alive, _replace_permanent,
_replace_pooled) and one module-level variable (_PERMANENT_CFG). Existing functions
get_crawler, init_permanent, and janitor each receive targeted changes to call
these helpers.

_is_crawler_alive uses _channel._connection._transport as the primary dead-browser
signal for regular browsers rather than is_connected(), which lags behind the actual
pipe close due to async callback scheduling. The transport reference is cleared
synchronously and is therefore reliable across all timing scenarios.

List of files changed and why

File Reason
deploy/docker/crawler_pool.py All crash-recovery logic is contained here. No public API changes, no new dependencies.

How Has This Been Tested?

All tests pass:

# uv run uses the global pytest (wrong Python); run via the venv directly:
$ uv pip install pytest pytest-asyncio   # once after uv sync
$ .venv/bin/python -m pytest \
    tests/docker/test_pool_release.py \
    tests/docker/test_crash_recovery.py \
    tests/test_issue_1842_browser_none.py -v
49 passed in 0.55s
Test file Tests What it covers
tests/docker/test_pool_release.py 10 release_crawler() and active_requests lifecycle
tests/docker/test_crash_recovery.py 26 _is_crawler_alive, _replace_permanent, _replace_pooled, get_crawler() recovery paths, janitor() health sweep
tests/test_issue_1842_browser_none.py 13 BrowserManager.create_browser_context() guard when self.browser is None

tests/docker/test_crash_recovery.py is new in this PR — it directly tests every function added to crawler_pool.py, including the critical transport=None override of is_connected() that is the core of the fix (see test_transport_none_overrides_is_connected).

Docker build:

docker compose build crawl4ai 

In-container integration (run inside the container):

# Normal path
await init_permanent(cfg)        # OK
crawler = await get_crawler(cfg) # active_requests=1
_is_crawler_alive(crawler)       # True
await release_crawler(crawler)   # active_requests=0

# Crash recovery path
await browser.close()            # simulate OOM kill
_is_crawler_alive(PERMANENT)     # False  ← correctly detected
crawler = await get_crawler(cfg) # auto-recovered, new browser alive, active_requests=1

Production deployment:

This patch has been running in a production Docker deployment since the fix was applied and continues to be monitored. During that period a timing edge case was identified — browser.is_connected() lags behind the actual OS pipe close event — and addressed by switching to a direct transport check (_channel._connection._transport), which is cleared synchronously when the pipe closes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Tests for the three-layer crash recovery mechanism added in the
bugfix/issue-842-pool-crash-recovery PR.

TestIsCrawlerAlive (12 tests):
  - Standard browser: alive/dead via transport reference, is_connected() lag,
    transport=None override
  - Persistent context: alive/dead paths
  - Edge cases: missing strategy/browser_manager/browser, exception safety

TestReplacePermanent (4 tests):
  - Closes old browser via _close_and_kill
  - Creates new browser with the config saved at init_permanent() time
  - Updates PERMANENT global and _PERMANENT_STARTED_AT

TestReplacePooled (2 tests):
  - Closes and replaces a dead hot-pool entry
  - New browser starts with active_requests=1

TestGetCrawlerRecovery (4 tests):
  - Dead permanent/hot/cold → respective replace helper is called
  - Alive permanent → no replacement triggered

TestJanitorHealthSweep (4 tests):
  - Dead permanent → _replace_permanent called with reason='dead'
  - Dead hot/cold → evicted and killed
  - Alive hot → left untouched

49 total tests across all three test files pass.

Refs unclecode#842
@eastwood-c eastwood-c changed the title Bugfix/issue 842 pool crash recovery fix: issue 842 pool crash recovery Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant