feat: cache retrieved docs and add PyPI package docs lookup#9
feat: cache retrieved docs and add PyPI package docs lookup#9ayhammouda wants to merge 8 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds persistent SQLite caching for ChangesPersistent Docs Cache and PyPI Package Lookup
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The PR introduces two independent feature areas (persistent caching and PyPI lookup) with significant logic density across multiple services, database schema changes, subprocess-based integration tests, and careful transport/framing fixes. The changes span models, services, server wiring, retrieval logic, and comprehensive test suites with heterogeneous concerns (caching, HTTP fetching, SQLite operations, MCP protocol). While individual components are well-scoped, the interdependencies between cache/content/server and the breadth of test scenarios require careful attention to correctness and edge cases. Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
tests/test_mcp_get_docs_cache_smoke.py (1)
10-16: ⚡ Quick winExtract shared stdio helpers into a dedicated test utility module.
Importing private helpers from another test module tightly couples these tests and makes future refactors noisier. Move these helpers into a shared
testssupport module and import both smoke suites from there.Refactor sketch
-from tests.test_stdio_smoke import ( +from tests.support.stdio_helpers import ( _assert_protocol_on_stdout_only, _find_response, _isolated_cache_env, _make_notification, _make_request, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_mcp_get_docs_cache_smoke.py` around lines 10 - 16, Move the private stdio helper functions into a shared test helpers module and update imports: create a new tests support module (e.g., tests.test_helpers or tests._helpers) and relocate the functions _assert_protocol_on_stdout_only, _find_response, _isolated_cache_env, _make_notification, and _make_request into that module; then change imports in this test (and in tests that currently import from tests.test_stdio_smoke) to import those helpers from the new module instead of importing them from _stdio_smoke. Ensure the new module exports the same helper names and update any relative import paths in affected tests to reference the new module.PR9_MCP_TEST_PLAN.md (1)
23-26: ⚡ Quick winMake setup steps environment-agnostic.
The hardcoded workspace path and branch-specific checkout make this plan hard to run outside your local/review environment. Prefer generic “repo root” instructions plus optional branch notes.
Proposed edit
-cd /srv/openclaw/.openclaw/workspace/tmp/python-docs-mcp-review -git checkout fix/open-issues-cache-pypi-docs || git checkout review-pr-9 -git pull --ff-only origin fix/open-issues-cache-pypi-docs +cd <repo-root> +# optional: checkout the PR branch under test +# git checkout fix/open-issues-cache-pypi-docs +git pull --ff-only🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@PR9_MCP_TEST_PLAN.md` around lines 23 - 26, The setup steps in PR9_MCP_TEST_PLAN.md are environment-specific—replace the hardcoded path "cd /srv/openclaw/.openclaw/workspace/tmp/python-docs-mcp-review" with a generic repo-root instruction (e.g., "cd <repo-root> or open project root"), and change the branch checkout commands "git checkout fix/open-issues-cache-pypi-docs || git checkout review-pr-9" to suggest a generic branch instruction (e.g., "git fetch && git checkout <branch-or-PR-branch> (or use the PR branch name)") plus an optional note showing the exact branch names used for this review; keep "git pull --ff-only origin <branch>" and "uv sync --all-extras" but parametrize the branch placeholder so the plan is runnable in any environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@PR9_MCP_TEST_PLAN.md`:
- Line 42: Update the pytest summary line in PR9_MCP_TEST_PLAN.md that currently
reads "254 passed, 3 skipped" to reflect the PR's actual test totals "247
passed, 3 skipped" so the documented expected results match the reported CI/PR
summary and avoid false FAILs.
- Line 71: Change the headings and inline phrases to use hyphenated compound
adjectives for consistency: replace "Full page retrieval" with "Full-page
retrieval" and any occurrences of "page not found style response" or similar
with "page-not-found-style response" (check the heading at "### 2.1 Full page
retrieval" and the repeated occurrence around lines noted as 111-111) so all
compound adjectives in PR9_MCP_TEST_PLAN.md follow the hyphenated style.
In `@src/mcp_server_python_docs/retrieval/ranker.py`:
- Around line 83-99: The current _resolve_symbol_location logic returns empty
string "" for a missing anchor which makes get_docs treat it as a section hit;
change the return value to use None (or normalize blank anchors to None) when
there is no valid section anchor. Specifically, in the branch that checks
section_row and the final return, return doc_row["slug"], None (or convert
section_row["anchor"] or "" into None) instead of "", and apply the same change
in the other occurrences referenced (the other blocks around lines with similar
logic) so doc_row, fallback_anchor, section_row and _resolve_symbol_location
consistently return None for “no anchor.”
In `@src/mcp_server_python_docs/services/persistent_cache.py`:
- Around line 32-35: The shared sqlite connection (self._conn) is used with
check_same_thread=False but not protected against concurrent access; serialize
DB access by adding a threading.Lock (e.g., self._db_lock created where the
connection is initialized) and wrap every database operation (methods like
get(), put(), and any other places that call self._conn.execute/commit) inside a
with self._db_lock: block (or acquire/release) so execute()/commit() calls
cannot run concurrently; alternatively, change the implementation to open a
short-lived sqlite3.connect(...) per operation and close it after the
query—apply the same change for all DB use sites referenced around the
__init__/connection code and the blocks noted (lines ~66-99, 101-123).
In `@tests/test_stdio_smoke.py`:
- Around line 151-162: The helper _assert_protocol_on_stdout_only currently
verifies JSON-RPC frames but doesn't assert the process exit status; update this
function to check result.returncode == 0 (or use result.check_returncode()) and
assert a successful exit with a clear message so smoke tests fail if the server
exited non-zero; locate the function _assert_protocol_on_stdout_only and add a
return-code assertion after reading/parsing stdout/stderr responses.
---
Nitpick comments:
In `@PR9_MCP_TEST_PLAN.md`:
- Around line 23-26: The setup steps in PR9_MCP_TEST_PLAN.md are
environment-specific—replace the hardcoded path "cd
/srv/openclaw/.openclaw/workspace/tmp/python-docs-mcp-review" with a generic
repo-root instruction (e.g., "cd <repo-root> or open project root"), and change
the branch checkout commands "git checkout fix/open-issues-cache-pypi-docs ||
git checkout review-pr-9" to suggest a generic branch instruction (e.g., "git
fetch && git checkout <branch-or-PR-branch> (or use the PR branch name)") plus
an optional note showing the exact branch names used for this review; keep "git
pull --ff-only origin <branch>" and "uv sync --all-extras" but parametrize the
branch placeholder so the plan is runnable in any environment.
In `@tests/test_mcp_get_docs_cache_smoke.py`:
- Around line 10-16: Move the private stdio helper functions into a shared test
helpers module and update imports: create a new tests support module (e.g.,
tests.test_helpers or tests._helpers) and relocate the functions
_assert_protocol_on_stdout_only, _find_response, _isolated_cache_env,
_make_notification, and _make_request into that module; then change imports in
this test (and in tests that currently import from tests.test_stdio_smoke) to
import those helpers from the new module instead of importing them from
_stdio_smoke. Ensure the new module exports the same helper names and update any
relative import paths in affected tests to reference the new module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 1968fe20-5be0-4725-92d0-2853ce1948b3
📒 Files selected for processing (17)
PR9_MCP_TEST_PLAN.mdPR9_SMOKE_FAILURE_FIX_PLAN.mdREADME.mdsrc/mcp_server_python_docs/__main__.pysrc/mcp_server_python_docs/app_context.pysrc/mcp_server_python_docs/models.pysrc/mcp_server_python_docs/retrieval/ranker.pysrc/mcp_server_python_docs/server.pysrc/mcp_server_python_docs/services/content.pysrc/mcp_server_python_docs/services/package_docs.pysrc/mcp_server_python_docs/services/persistent_cache.pytests/conftest.pytests/test_mcp_get_docs_cache_smoke.pytests/test_package_docs.pytests/test_persistent_docs_cache.pytests/test_services.pytests/test_stdio_smoke.py
|
|
||
| - Ruff passes. | ||
| - Pyright passes for `src/`. | ||
| - Pytest passes: currently expected `254 passed, 3 skipped`. |
There was a problem hiding this comment.
Update the expected pytest totals to match this PR context.
Line 42 says 254 passed, 3 skipped, but the PR summary for this change set reports 247 passed, 3 skipped. This can cause false FAIL interpretation during manual validation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@PR9_MCP_TEST_PLAN.md` at line 42, Update the pytest summary line in
PR9_MCP_TEST_PLAN.md that currently reads "254 passed, 3 skipped" to reflect the
PR's actual test totals "247 passed, 3 skipped" so the documented expected
results match the reported CI/PR summary and avoid false FAILs.
|
|
||
| ## Test Area 2 — `get_docs` Functional Retrieval | ||
|
|
||
| ### 2.1 Full page retrieval |
There was a problem hiding this comment.
Use hyphenated compound adjectives for consistency.
Consider “Full-page retrieval” and “page-not-found-style response” for clearer technical writing.
Also applies to: 111-111
🧰 Tools
🪛 LanguageTool
[uncategorized] ~71-~71: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...get_docs` Functional Retrieval ### 2.1 Full page retrieval Call: ```text get_docs(slug...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@PR9_MCP_TEST_PLAN.md` at line 71, Change the headings and inline phrases to
use hyphenated compound adjectives for consistency: replace "Full page
retrieval" with "Full-page retrieval" and any occurrences of "page not found
style response" or similar with "page-not-found-style response" (check the
heading at "### 2.1 Full page retrieval" and the repeated occurrence around
lines noted as 111-111) so all compound adjectives in PR9_MCP_TEST_PLAN.md
follow the hyphenated style.
| if doc_row is None: | ||
| return fallback_slug, fallback_anchor | ||
|
|
||
| if fallback_anchor: | ||
| section_row = conn.execute( | ||
| """ | ||
| SELECT anchor | ||
| FROM sections | ||
| WHERE document_id = ? AND anchor = ? | ||
| LIMIT 1 | ||
| """, | ||
| (doc_row["id"], fallback_anchor), | ||
| ).fetchone() | ||
| if section_row is not None: | ||
| return doc_row["slug"], section_row["anchor"] or "" | ||
|
|
||
| return doc_row["slug"], "" |
There was a problem hiding this comment.
Don't emit anchor="" for document-level symbol hits.
When _resolve_symbol_location() falls back to a page-only location, both symbol search paths surface an empty-string anchor. get_docs() treats any non-None anchor as a section lookup, so forwarding that hit verbatim turns a retrievable page hit into PageNotFoundError. Return None for “no anchor”, or normalize blank anchors before exposing them downstream.
Also applies to: 235-247, 352-365
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/mcp_server_python_docs/retrieval/ranker.py` around lines 83 - 99, The
current _resolve_symbol_location logic returns empty string "" for a missing
anchor which makes get_docs treat it as a section hit; change the return value
to use None (or normalize blank anchors to None) when there is no valid section
anchor. Specifically, in the branch that checks section_row and the final
return, return doc_row["slug"], None (or convert section_row["anchor"] or ""
into None) instead of "", and apply the same change in the other occurrences
referenced (the other blocks around lines with similar logic) so doc_row,
fallback_anchor, section_row and _resolve_symbol_location consistently return
None for “no anchor.”
| try: | ||
| self._cache_path.parent.mkdir(parents=True, exist_ok=True) | ||
| self._conn = sqlite3.connect(str(self._cache_path), check_same_thread=False) | ||
| self._conn.execute("PRAGMA synchronous = NORMAL") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Per the Python sqlite3 documentation, does setting check_same_thread=Falsemake a single connection safe for concurrentexecute()/commit() calls from multiple threads, or must access still be serialized by the application?
💡 Result:
No, setting check_same_thread=False does not make a single sqlite3 connection safe for concurrent execute() or commit() calls from multiple threads. Access must still be serialized by the application, particularly for write operations, to avoid data corruption.[1][2][3] The official Python sqlite3 documentation states: "If False, the connection may be accessed in multiple threads; write operations may need to be serialized by the user to avoid data corruption. See threadsafety for more information."[1][2][3] The sqlite3.threadsafety attribute indicates the underlying SQLite library's threading mode, typically 2 (multi-thread) or 3 (serialized).[1][2] In multi-thread mode (common default), no single connection should be used simultaneously by multiple threads.[4] Even in serialized mode, Python's wrapper may require additional synchronization, as noted in CPython issues.[5] Secondary sources confirm: applications must use locks to serialize access to shared connections.[6] Concurrent writes without serialization risk database corruption or errors.[7][8][9]
Citations:
- 1: https://docs.python.org/3/library/sqlite3.html
- 2: https://docs.python.org/3.13/library/sqlite3.html
- 3: https://docs.python.org/3/library/sqlite3.html?highlight=connect
- 4: https://sqlite.org/threadsafe.html
- 5:
sqlite3,check_same_threadshould betrue? python/cpython#128697 - 6: https://stackoverflow.com/questions/22739590/how-to-share-single-sqlite-connection-in-multi-threaded-python-application
- 7: Is `check_same_thread=False` Safe? fastapi/fastapi#5199
- 8: https://stackoverflow.com/questions/393554/python-sqlite3-and-concurrency
- 9: https://stackoverflow.com/questions/35717263/how-threadsafe-is-sqlite3
Serialize access to the shared cache connection.
check_same_thread=False only disables sqlite3's thread-affinity check—it does not make a single connection safe for concurrent execute() or commit() calls. Since this cache instance is shared across tool invocations, parallel get() and put() requests will race and risk data corruption or dropped writes. Wrap all database operations with a lock, or use a short-lived connection per operation.
🔒 One straightforward fix
+import threading
...
class PersistentDocsCache:
@@
def __init__(self, cache_path: Path, index_path: Path) -> None:
@@
self._hits = self._misses = self._writes = 0
+ self._lock = threading.Lock()
self._conn: sqlite3.Connection | None = None
@@
try:
- row = self._conn.execute(
- "SELECT result_json FROM retrieved_docs_cache WHERE index_fingerprint = ? "
- "AND version = ? AND slug = ? AND anchor = ? AND max_chars = ? AND start_index = ?",
- (
- self._fingerprint,
- version,
- slug,
- self._anchor_key(anchor),
- max_chars,
- start_index,
- ),
- ).fetchone()
+ with self._lock:
+ row = self._conn.execute(
+ "SELECT result_json FROM retrieved_docs_cache WHERE index_fingerprint = ? "
+ "AND version = ? AND slug = ? AND anchor = ? AND max_chars = ? AND start_index = ?",
+ (
+ self._fingerprint,
+ version,
+ slug,
+ self._anchor_key(anchor),
+ max_chars,
+ start_index,
+ ),
+ ).fetchone()
@@
try:
- self._conn.execute(
- "INSERT OR REPLACE INTO retrieved_docs_cache "
- "(index_fingerprint, version, slug, anchor, max_chars, start_index, result_json) "
- "VALUES (?, ?, ?, ?, ?, ?, ?)",
- (
- self._fingerprint,
- result.version,
- result.slug,
- self._anchor_key(result.anchor),
- max_chars,
- start_index,
- result.model_dump_json(),
- ),
- )
- self._conn.commit()
+ with self._lock:
+ self._conn.execute(
+ "INSERT OR REPLACE INTO retrieved_docs_cache "
+ "(index_fingerprint, version, slug, anchor, max_chars, start_index, result_json) "
+ "VALUES (?, ?, ?, ?, ?, ?, ?)",
+ (
+ self._fingerprint,
+ result.version,
+ result.slug,
+ self._anchor_key(result.anchor),
+ max_chars,
+ start_index,
+ result.model_dump_json(),
+ ),
+ )
+ self._conn.commit()Also applies to: 66-99, 101-123
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/mcp_server_python_docs/services/persistent_cache.py` around lines 32 -
35, The shared sqlite connection (self._conn) is used with
check_same_thread=False but not protected against concurrent access; serialize
DB access by adding a threading.Lock (e.g., self._db_lock created where the
connection is initialized) and wrap every database operation (methods like
get(), put(), and any other places that call self._conn.execute/commit) inside a
with self._db_lock: block (or acquire/release) so execute()/commit() calls
cannot run concurrently; alternatively, change the implementation to open a
short-lived sqlite3.connect(...) per operation and close it after the
query—apply the same change for all DB use sites referenced around the
__init__/connection code and the blocks noted (lines ~66-99, 101-123).
| def _assert_protocol_on_stdout_only(result: subprocess.CompletedProcess) -> list[dict]: | ||
| """Assert JSON-RPC protocol frames are emitted only on stdout.""" | ||
| responses = _read_responses(result.stdout) | ||
| assert responses, f"No JSON-RPC responses on stdout; stderr was: {result.stderr!r}" | ||
|
|
||
| for resp in responses: | ||
| assert "_raw" not in resp, f"Non-JSON stdout pollution: {resp.get('_raw')}" | ||
| assert resp.get("jsonrpc") == "2.0", f"Missing JSON-RPC frame on stdout: {resp}" | ||
|
|
||
| stderr_frames = _jsonrpc_frames(result.stderr) | ||
| assert stderr_frames == [], f"JSON-RPC frames leaked to stderr: {stderr_frames}" | ||
| return responses |
There was a problem hiding this comment.
Assert successful process exit in protocol helper.
_assert_protocol_on_stdout_only validates framing but can still pass if the server emits responses and then exits non-zero. Add a return-code check to keep smoke tests aligned with the “no crash” intent.
Suggested patch
def _assert_protocol_on_stdout_only(result: subprocess.CompletedProcess) -> list[dict]:
"""Assert JSON-RPC protocol frames are emitted only on stdout."""
+ assert result.returncode == 0, (
+ f"Server exited with non-zero status {result.returncode}; stderr was: {result.stderr!r}"
+ )
responses = _read_responses(result.stdout)
assert responses, f"No JSON-RPC responses on stdout; stderr was: {result.stderr!r}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_stdio_smoke.py` around lines 151 - 162, The helper
_assert_protocol_on_stdout_only currently verifies JSON-RPC frames but doesn't
assert the process exit status; update this function to check result.returncode
== 0 (or use result.check_returncode()) and assert a successful exit with a
clear message so smoke tests fail if the server exited non-zero; locate the
function _assert_protocol_on_stdout_only and add a return-code assertion after
reading/parsing stdout/stderr responses.
Summary
Fixes #8
Fixes #6
Changes
get_docsresults so retrieved docs survive MCP client/server restarts.max_chars, andstart_index.index.dbis rebuilt or replaced.lookup_package_docs, a controlled PyPI metadata lookup tool that returns only PyPI/project-declared docs/homepage/source URLs and makes thepypi-declared-metadatatrust boundary explicit.Testing
uv run ruff check src/ tests/uv run pyright src/uv run pytest --tb=short -q— 247 passed, 3 skippedNotes
lookup_package_docsdoes not crawl or fetch arbitrary package documentation pages. It queries the official PyPI JSON API and returns declared source URLs only, so it stays inside the controlled trust boundary instead of becoming a generic web search.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Tests