Feat/update search tool by suluyana · Pull Request #902 · modelscope/ms-agent

suluyana · 2026-04-21T09:02:18Z

Change Summary

Related issue number

Checklist

The pull request title is a good summary of the changes - it will be used in the changelog
Unit tests for the changes exist
Run pre-commit install and pre-commit run --all-files before git commit, and passed lint check.
Documentation reflects the changes where applicable

…date reporter delivery flow, and improve the quality check module (54.51)

…t-report guidance

…haracter for report qa

…at/enhance_dsv2

…at/dr_reasoning

…agent into hanzhou/0413

Add Tavily HTTP client, search/extract schema, WebSearchTool integration, optional large-result spill, researcher/searcher Tavily YAML presets, and run_benchmark env hooks for RESEARCHER_CONFIG / BENCH paths. Made-with: Cursor

…ack) WebSearchTool imports fetch_single_text_with_meta; add tiered fetch helpers and optional Playwright fallback module used by jina_reader. Made-with: Cursor

Merge upstream main is already applied on the base branch. - Extend jina_reader with meta fetch, proxy parsing, and fetch fallbacks - Wire websearch/tool_manager and arxiv schema adjustments - Update deep_research v2 searcher config and benchmark script - Add jina_reader cascade tests; refresh search tests and gitignore Made-with: Cursor

gemini-code-assist

Code Review

This pull request introduces significant enhancements to the research agent, including a tiered content fetching system (Jina Reader, direct HTTP, and Playwright fallback), support for the Tavily Search API, and a mechanism to spill large web search payloads to disk to manage context limits. It also adds format cleanup for research reports and improves the robustness of the Exa search engine by implementing API key rotation. My review identified several areas for improvement, including the need to run blocking cleanup tasks in an executor to avoid blocking the event loop, refactoring duplicated configuration logic, and addressing inconsistencies in response fields.

gemini-code-assist · 2026-04-21T09:06:40Z

+# yapf: disable
 import os
+import re
+import shutil


Add import asyncio to support running blocking cleanup tasks in an executor within the async callback methods.

Suggested change

# yapf: disable

import os

import re

import shutil

# yapf: disable

import asyncio

import os

import re

import shutil

gemini-code-assist · 2026-04-21T09:06:40Z

+                logger.info(
+                    'ResearcherCallback: running format cleanup agent on '
+                    f'{self.report_filename}...')
+                self._run_format_cleanup(self._report_path)


The _run_format_cleanup method performs a synchronous network call to the LLM and blocking file I/O. Since on_task_end is an async method, calling it directly blocks the event loop. This should be executed in a thread pool using run_in_executor to maintain the responsiveness of the agent, especially if this callback is used in a server environment.

Suggested change

self._run_format_cleanup(self._report_path)

loop = asyncio.get_running_loop()

await loop.run_in_executor(None, self._run_format_cleanup, self._report_path)

gemini-code-assist · 2026-04-21T09:06:40Z

        config = JinaReaderConfig(
-            timeout=kwargs.get('timeout', 30.0),
+            timeout=kwargs.get('timeout', 45.0),
            retries=kwargs.get('retries', 3),
+            direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)),
+            direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)),
+            playwright_fetch_fallback=bool(
+                kwargs.get('playwright_fetch_fallback', True)),
+            playwright_retry_min_chars=int(
+                kwargs.get('playwright_retry_min_chars', 400) or 400),
+            playwright_timeout_ms=int(
+                kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
+            playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)),
+            use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
        )
        return JinaContentFetcher(config)
+    if fetcher_type == 'tavily_extract':
+        from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher
+        return TavilyExtractFetcher(
+            api_key=kwargs.get('tavily_api_key'),
+            extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')),
+            format=str(kwargs.get('tavily_extract_format', 'markdown')),
+            timeout=float(kwargs.get('timeout', 45.0)),
+            chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)),
+            include_images=bool(kwargs.get('tavily_extract_include_images', False)),
+            include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)),
+            include_usage=bool(kwargs.get('tavily_extract_include_usage', False)),
+        )
    # Future: add more fetchers
    # elif fetcher_type == 'docling':
    #     return DoclingContentFetcher(**kwargs)
    else:
        logger.warning(
            f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader"
        )
-        return JinaContentFetcher()
+        return JinaContentFetcher(
+            JinaReaderConfig(
+                timeout=kwargs.get('timeout', 45.0),
+                retries=kwargs.get('retries', 3),
+                direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback',
+                                                     True)),
+                direct_fetch_timeout=float(
+                    kwargs.get('direct_fetch_timeout', 15.0)),
+                playwright_fetch_fallback=bool(
+                    kwargs.get('playwright_fetch_fallback', True)),
+                playwright_retry_min_chars=int(
+                    kwargs.get('playwright_retry_min_chars', 400) or 400),
+                playwright_timeout_ms=int(
+                    kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
+                playwright_settle_ms=int(
+                    kwargs.get('playwright_settle_ms', 350)),
+                use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
+            ))


The initialization logic for JinaReaderConfig is duplicated between the jina_reader case and the fallback else block. This redundancy makes the code harder to maintain and increases the risk of inconsistencies when updating configuration parameters. Refactoring to define the configuration once before the conditional blocks would improve maintainability.

"""Factory function to get content fetcher by type.""" jina_config = JinaReaderConfig( timeout=kwargs.get('timeout', 45.0), retries=kwargs.get('retries', 3), direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)), direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)), playwright_fetch_fallback=bool( kwargs.get('playwright_fetch_fallback', True)), playwright_retry_min_chars=int( kwargs.get('playwright_retry_min_chars', 400) or 400), playwright_timeout_ms=int( kwargs.get('playwright_timeout_ms', 30_000) or 30_000), playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)), use_system_proxy=bool(kwargs.get('use_system_proxy', True)), ) if fetcher_type == 'jina_reader': return JinaContentFetcher(jina_config) if fetcher_type == 'tavily_extract': from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher return TavilyExtractFetcher( api_key=kwargs.get('tavily_api_key'), extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')), format=str(kwargs.get('tavily_extract_format', 'markdown')), timeout=float(kwargs.get('timeout', 45.0)), chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)), include_images=bool(kwargs.get('tavily_extract_include_images', False)), include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)), include_usage=bool(kwargs.get('tavily_extract_include_usage', False)), ) else: logger.warning( f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader" ) return JinaContentFetcher(jina_config)

gemini-code-assist · 2026-04-21T09:06:40Z

@@ -1082,6 +1420,10 @@ async def fetch_page(self, url: str) -> str:
            result.get('published_at', ''),


The published_at field in the fetch_page response will always be empty because the date extraction logic in _fetch_content_sync (previously at lines 871-872) has been commented out. If date extraction is no longer supported for this tool, the field should be removed from the response to avoid misleading the LLM with empty data. Otherwise, the extraction logic should be restored.

suluyan and others added 27 commits February 6, 2026 15:03

fix: video gen exclude edit_file

6baa994

Merge branch 'main' of https://github.com/modelscope/ms-agent

9c8205c

Merge branch 'main' of https://github.com/modelscope/ms-agent

94b833f

enhance deep research v2

11bcf06

Merge branch 'main' of https://github.com/modelscope/ms-agent

dd3f185

refactor: optimize architecture, restrict researcher report edits, up…

05cb676

…date reporter delivery flow, and improve the quality check module (54.51)

fix local code executor; refine workflow and prompt (03)

1dd49b6

fix timeout; support for running subagent in process; support for pos…

c27347f

…t-report guidance

refine readme for deep research; add run_benchmark.sh; fix counting c…

5920dc4

…haracter for report qa

Merge branch 'main' of https://github.com/modelscope/ms-agent into fe…

14cb873

…at/enhance_dsv2

enrich researcher's reflection strategy to enhance stability

986b34f

Merge branch 'main' of https://github.com/modelscope/ms-agent

e6ad3e9

thinking support beta; search_file_content fix

b437bdc

support API key pool construction; support for reasoning model

86a3ba8

fix lint

8a32ce5

support for vertex type anthropic llm; refine reasoning output

393f23c

Merge branch 'main' of https://github.com/modelscope/ms-agent

49919ab

support for response api; optimize log output formatting

f08a15e

Merge branch 'main' of https://github.com/modelscope/ms-agent into fe…

5825111

…at/dr_reasoning

fix lint

6a1ba9f

Merge branch 'main' of https://github.com/modelscope/ms-agent

c6b1e3b

Merge branch 'main' of https://github.com/modelscope/ms-agent

2f2dbb0

Merge branch 'feat/dr_reasoning' of https://github.com/alcholiclg/ms-…

9841444

…agent into hanzhou/0413

fix(jina): align reader with websearch (meta fetch + playwright fallb…

cb75af3

…ack) WebSearchTool imports fetch_single_text_with_meta; add tiered fetch helpers and optional Playwright fallback module used by jina_reader. Made-with: Cursor

Merge remote-tracking branch 'upstream/main' into hanzhou/0413

70eb30c

suluyana had a problem deploying to testci April 21, 2026 09:02 — with GitHub Actions Error

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/update search tool #902

Feat/update search tool #902
suluyana wants to merge 27 commits into
modelscope:mainfrom
suluyana:feat/jina-reader-search-updates

suluyana commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self._run_format_cleanup(self._report_path)
	loop = asyncio.get_running_loop()
	await loop.run_in_executor(None, self._run_format_cleanup, self._report_path)

		@@ -1082,6 +1420,10 @@ async def fetch_page(self, url: str) -> str:
		result.get('published_at', ''),

Conversation

suluyana commented Apr 21, 2026

Change Summary

Related issue number

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant