Conversation
…date reporter delivery flow, and improve the quality check module (54.51)
…t-report guidance
…haracter for report qa
…at/enhance_dsv2
…at/dr_reasoning
…agent into hanzhou/0413
Add Tavily HTTP client, search/extract schema, WebSearchTool integration, optional large-result spill, researcher/searcher Tavily YAML presets, and run_benchmark env hooks for RESEARCHER_CONFIG / BENCH paths. Made-with: Cursor
…ack) WebSearchTool imports fetch_single_text_with_meta; add tiered fetch helpers and optional Playwright fallback module used by jina_reader. Made-with: Cursor
Merge upstream main is already applied on the base branch. - Extend jina_reader with meta fetch, proxy parsing, and fetch fallbacks - Wire websearch/tool_manager and arxiv schema adjustments - Update deep_research v2 searcher config and benchmark script - Add jina_reader cascade tests; refresh search tests and gitignore Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to the research agent, including a tiered content fetching system (Jina Reader, direct HTTP, and Playwright fallback), support for the Tavily Search API, and a mechanism to spill large web search payloads to disk to manage context limits. It also adds format cleanup for research reports and improves the robustness of the Exa search engine by implementing API key rotation. My review identified several areas for improvement, including the need to run blocking cleanup tasks in an executor to avoid blocking the event loop, refactoring duplicated configuration logic, and addressing inconsistencies in response fields.
| # yapf: disable | ||
| import os | ||
| import re | ||
| import shutil |
| logger.info( | ||
| 'ResearcherCallback: running format cleanup agent on ' | ||
| f'{self.report_filename}...') | ||
| self._run_format_cleanup(self._report_path) |
There was a problem hiding this comment.
The _run_format_cleanup method performs a synchronous network call to the LLM and blocking file I/O. Since on_task_end is an async method, calling it directly blocks the event loop. This should be executed in a thread pool using run_in_executor to maintain the responsiveness of the agent, especially if this callback is used in a server environment.
| self._run_format_cleanup(self._report_path) | |
| loop = asyncio.get_running_loop() | |
| await loop.run_in_executor(None, self._run_format_cleanup, self._report_path) |
| config = JinaReaderConfig( | ||
| timeout=kwargs.get('timeout', 30.0), | ||
| timeout=kwargs.get('timeout', 45.0), | ||
| retries=kwargs.get('retries', 3), | ||
| direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)), | ||
| direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)), | ||
| playwright_fetch_fallback=bool( | ||
| kwargs.get('playwright_fetch_fallback', True)), | ||
| playwright_retry_min_chars=int( | ||
| kwargs.get('playwright_retry_min_chars', 400) or 400), | ||
| playwright_timeout_ms=int( | ||
| kwargs.get('playwright_timeout_ms', 30_000) or 30_000), | ||
| playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)), | ||
| use_system_proxy=bool(kwargs.get('use_system_proxy', True)), | ||
| ) | ||
| return JinaContentFetcher(config) | ||
| if fetcher_type == 'tavily_extract': | ||
| from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher | ||
| return TavilyExtractFetcher( | ||
| api_key=kwargs.get('tavily_api_key'), | ||
| extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')), | ||
| format=str(kwargs.get('tavily_extract_format', 'markdown')), | ||
| timeout=float(kwargs.get('timeout', 45.0)), | ||
| chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)), | ||
| include_images=bool(kwargs.get('tavily_extract_include_images', False)), | ||
| include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)), | ||
| include_usage=bool(kwargs.get('tavily_extract_include_usage', False)), | ||
| ) | ||
| # Future: add more fetchers | ||
| # elif fetcher_type == 'docling': | ||
| # return DoclingContentFetcher(**kwargs) | ||
| else: | ||
| logger.warning( | ||
| f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader" | ||
| ) | ||
| return JinaContentFetcher() | ||
| return JinaContentFetcher( | ||
| JinaReaderConfig( | ||
| timeout=kwargs.get('timeout', 45.0), | ||
| retries=kwargs.get('retries', 3), | ||
| direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', | ||
| True)), | ||
| direct_fetch_timeout=float( | ||
| kwargs.get('direct_fetch_timeout', 15.0)), | ||
| playwright_fetch_fallback=bool( | ||
| kwargs.get('playwright_fetch_fallback', True)), | ||
| playwright_retry_min_chars=int( | ||
| kwargs.get('playwright_retry_min_chars', 400) or 400), | ||
| playwright_timeout_ms=int( | ||
| kwargs.get('playwright_timeout_ms', 30_000) or 30_000), | ||
| playwright_settle_ms=int( | ||
| kwargs.get('playwright_settle_ms', 350)), | ||
| use_system_proxy=bool(kwargs.get('use_system_proxy', True)), | ||
| )) |
There was a problem hiding this comment.
The initialization logic for JinaReaderConfig is duplicated between the jina_reader case and the fallback else block. This redundancy makes the code harder to maintain and increases the risk of inconsistencies when updating configuration parameters. Refactoring to define the configuration once before the conditional blocks would improve maintainability.
"""Factory function to get content fetcher by type."""
jina_config = JinaReaderConfig(
timeout=kwargs.get('timeout', 45.0),
retries=kwargs.get('retries', 3),
direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)),
direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)),
playwright_fetch_fallback=bool(
kwargs.get('playwright_fetch_fallback', True)),
playwright_retry_min_chars=int(
kwargs.get('playwright_retry_min_chars', 400) or 400),
playwright_timeout_ms=int(
kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)),
use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
)
if fetcher_type == 'jina_reader':
return JinaContentFetcher(jina_config)
if fetcher_type == 'tavily_extract':
from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher
return TavilyExtractFetcher(
api_key=kwargs.get('tavily_api_key'),
extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')),
format=str(kwargs.get('tavily_extract_format', 'markdown')),
timeout=float(kwargs.get('timeout', 45.0)),
chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)),
include_images=bool(kwargs.get('tavily_extract_include_images', False)),
include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)),
include_usage=bool(kwargs.get('tavily_extract_include_usage', False)),
)
else:
logger.warning(
f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader"
)
return JinaContentFetcher(jina_config)| @@ -1082,6 +1420,10 @@ async def fetch_page(self, url: str) -> str: | |||
| result.get('published_at', ''), | |||
There was a problem hiding this comment.
The published_at field in the fetch_page response will always be empty because the date extraction logic in _fetch_content_sync (previously at lines 871-872) has been commented out. If date extraction is no longer supported for this tool, the field should be removed from the response to avoid misleading the LLM with empty data. Otherwise, the extraction logic should be restored.
Change Summary
Related issue number
Checklist
pre-commit installandpre-commit run --all-filesbefore git commit, and passed lint check.