Skip to content

Feat/update search tool #902

Open
suluyana wants to merge 27 commits intomodelscope:mainfrom
suluyana:feat/jina-reader-search-updates
Open

Feat/update search tool #902
suluyana wants to merge 27 commits intomodelscope:mainfrom
suluyana:feat/jina-reader-search-updates

Conversation

@suluyana
Copy link
Copy Markdown
Collaborator

Change Summary

Related issue number

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Run pre-commit install and pre-commit run --all-files before git commit, and passed lint check.
  • Documentation reflects the changes where applicable

suluyan and others added 27 commits February 6, 2026 15:03
…date reporter delivery flow, and improve the quality check module (54.51)
Add Tavily HTTP client, search/extract schema, WebSearchTool integration,
optional large-result spill, researcher/searcher Tavily YAML presets, and
run_benchmark env hooks for RESEARCHER_CONFIG / BENCH paths.

Made-with: Cursor
…ack)

WebSearchTool imports fetch_single_text_with_meta; add tiered fetch helpers
and optional Playwright fallback module used by jina_reader.

Made-with: Cursor
Merge upstream main is already applied on the base branch.

- Extend jina_reader with meta fetch, proxy parsing, and fetch fallbacks
- Wire websearch/tool_manager and arxiv schema adjustments
- Update deep_research v2 searcher config and benchmark script
- Add jina_reader cascade tests; refresh search tests and gitignore

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to the research agent, including a tiered content fetching system (Jina Reader, direct HTTP, and Playwright fallback), support for the Tavily Search API, and a mechanism to spill large web search payloads to disk to manage context limits. It also adds format cleanup for research reports and improves the robustness of the Exa search engine by implementing API key rotation. My review identified several areas for improvement, including the need to run blocking cleanup tasks in an executor to avoid blocking the event loop, refactoring duplicated configuration logic, and addressing inconsistencies in response fields.

Comment on lines +2 to +5
# yapf: disable
import os
import re
import shutil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Add import asyncio to support running blocking cleanup tasks in an executor within the async callback methods.

Suggested change
# yapf: disable
import os
import re
import shutil
# yapf: disable
import asyncio
import os
import re
import shutil

logger.info(
'ResearcherCallback: running format cleanup agent on '
f'{self.report_filename}...')
self._run_format_cleanup(self._report_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _run_format_cleanup method performs a synchronous network call to the LLM and blocking file I/O. Since on_task_end is an async method, calling it directly blocks the event loop. This should be executed in a thread pool using run_in_executor to maintain the responsiveness of the agent, especially if this callback is used in a server environment.

Suggested change
self._run_format_cleanup(self._report_path)
loop = asyncio.get_running_loop()
await loop.run_in_executor(None, self._run_format_cleanup, self._report_path)

Comment on lines 199 to +250
config = JinaReaderConfig(
timeout=kwargs.get('timeout', 30.0),
timeout=kwargs.get('timeout', 45.0),
retries=kwargs.get('retries', 3),
direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)),
direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)),
playwright_fetch_fallback=bool(
kwargs.get('playwright_fetch_fallback', True)),
playwright_retry_min_chars=int(
kwargs.get('playwright_retry_min_chars', 400) or 400),
playwright_timeout_ms=int(
kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)),
use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
)
return JinaContentFetcher(config)
if fetcher_type == 'tavily_extract':
from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher
return TavilyExtractFetcher(
api_key=kwargs.get('tavily_api_key'),
extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')),
format=str(kwargs.get('tavily_extract_format', 'markdown')),
timeout=float(kwargs.get('timeout', 45.0)),
chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)),
include_images=bool(kwargs.get('tavily_extract_include_images', False)),
include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)),
include_usage=bool(kwargs.get('tavily_extract_include_usage', False)),
)
# Future: add more fetchers
# elif fetcher_type == 'docling':
# return DoclingContentFetcher(**kwargs)
else:
logger.warning(
f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader"
)
return JinaContentFetcher()
return JinaContentFetcher(
JinaReaderConfig(
timeout=kwargs.get('timeout', 45.0),
retries=kwargs.get('retries', 3),
direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback',
True)),
direct_fetch_timeout=float(
kwargs.get('direct_fetch_timeout', 15.0)),
playwright_fetch_fallback=bool(
kwargs.get('playwright_fetch_fallback', True)),
playwright_retry_min_chars=int(
kwargs.get('playwright_retry_min_chars', 400) or 400),
playwright_timeout_ms=int(
kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
playwright_settle_ms=int(
kwargs.get('playwright_settle_ms', 350)),
use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The initialization logic for JinaReaderConfig is duplicated between the jina_reader case and the fallback else block. This redundancy makes the code harder to maintain and increases the risk of inconsistencies when updating configuration parameters. Refactoring to define the configuration once before the conditional blocks would improve maintainability.

    """Factory function to get content fetcher by type."""
    jina_config = JinaReaderConfig(
        timeout=kwargs.get('timeout', 45.0),
        retries=kwargs.get('retries', 3),
        direct_fetch_fallback=bool(kwargs.get('direct_fetch_fallback', True)),
        direct_fetch_timeout=float(kwargs.get('direct_fetch_timeout', 15.0)),
        playwright_fetch_fallback=bool(
            kwargs.get('playwright_fetch_fallback', True)),
        playwright_retry_min_chars=int(
            kwargs.get('playwright_retry_min_chars', 400) or 400),
        playwright_timeout_ms=int(
            kwargs.get('playwright_timeout_ms', 30_000) or 30_000),
        playwright_settle_ms=int(kwargs.get('playwright_settle_ms', 350)),
        use_system_proxy=bool(kwargs.get('use_system_proxy', True)),
    )
    if fetcher_type == 'jina_reader':
        return JinaContentFetcher(jina_config)
    if fetcher_type == 'tavily_extract':
        from ms_agent.tools.search.tavily.fetcher import TavilyExtractFetcher
        return TavilyExtractFetcher(
            api_key=kwargs.get('tavily_api_key'),
            extract_depth=str(kwargs.get('tavily_extract_depth', 'advanced')),
            format=str(kwargs.get('tavily_extract_format', 'markdown')),
            timeout=float(kwargs.get('timeout', 45.0)),
            chunks_per_source=int(kwargs.get('tavily_extract_chunks_per_source', 3)),
            include_images=bool(kwargs.get('tavily_extract_include_images', False)),
            include_favicon=bool(kwargs.get('tavily_extract_include_favicon', False)),
            include_usage=bool(kwargs.get('tavily_extract_include_usage', False)),
        )
    else:
        logger.warning(
            f"Unknown fetcher type '{fetcher_type}', falling back to jina_reader"
        )
        return JinaContentFetcher(jina_config)

@@ -1082,6 +1420,10 @@ async def fetch_page(self, url: str) -> str:
result.get('published_at', ''),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The published_at field in the fetch_page response will always be empty because the date extraction logic in _fetch_content_sync (previously at lines 871-872) has been commented out. If date extraction is no longer supported for this tool, the field should be removed from the response to avoid misleading the LLM with empty data. Otherwise, the extraction logic should be restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant