Skip to content

feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359

Open
piexian wants to merge 13 commits intoAstrBotDevs:masterfrom
piexian:feat/exa-search-provider
Open

feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359
piexian wants to merge 13 commits intoAstrBotDevs:masterfrom
piexian:feat/exa-search-provider

Conversation

@piexian
Copy link
Copy Markdown
Contributor

@piexian piexian commented Apr 4, 2026

动机

改动点

  • Exa 搜索提供商:新增三个 @llm_tool 工具

    • web_search_exa:语义搜索,支持 5 种搜索类型(auto / neural / fast / instant / deep)和 6 个垂直领域(company / people / research paper / news / personal site / financial report
    • exa_extract_web_page:通过 /contents 端点提取网页全文内容
    • exa_find_similar:通过 /findSimilar 端点查找语义相似网页
  • API Base URL 可配置:Tavily 和 Exa 的 Base URL 可在 WebUI 中自定义,改动覆盖 web_searcherurl_parserkb_helper 等链路

  • 可选超时配置:AstrBot 自带联网搜索工具支持可选 timeout 参数,默认 30 秒

  • 配置元数据 i18ndefault.py 新增配置项及条件渲染元数据,en-US / ru-RU / zh-CN 三语同步更新

  • 工具管理与共享能力整理

    • 将 Web Search 相关工具名单和通用处理逻辑收敛到共享工具函数中
    • 统一 Tavily / Exa Base URL 的规范化逻辑,方便多个模块复用
    • astr_agent_hooks.py 在 WebChat 中为搜索类工具补充 <ref>index</ref> 引用提示,帮助模型输出可追踪的来源标记
  • 引用来源链路补全

    • 后端统一从结构化搜索结果中构建引用来源,不再只依赖正文中的 <ref>
    • chat.py / live_chat.py 共用网页搜索引用提取逻辑
    • MessageList.vue<ref> 引用解析支持 Exa / BoCha / exa_find_similar,不再只识别 web_search_tavily
    • 当前端收到只有工具结果、没有最终正文或没有 <ref> 的消息时,仍可降级展示来源列表
  • 测试补充:新增 tests/unit/test_web_search_utils.py,覆盖网页搜索结果映射、favicon 透传、显式 <ref> 命中和无 <ref> 回退等场景

  • 文档:中英文 websearch.md 补充 default / Tavily / Baidu AI Search / BoCha / Exa 的工具说明与参数说明

  • 这不是一个破坏性变更

运行截图或测试结果

本地验证命令:

uv run ruff format .
uv run ruff check .
uv run pytest tests/unit/test_web_search_utils.py
pnpm --dir dashboard exec vue-tsc --noEmit
pnpm --dir dashboard run build
image_53 image_61 image_54 image_55 image_56 image_57 image_58 image_59 image_60 image

检查清单

  • 如果 PR 中有新增功能,已经通过 Issue / 邮件等方式和作者讨论过
  • 已完成必要测试,并在上方提供验证命令和运行截图
  • 未引入新的依赖;如有新增依赖,已同步到相应配置文件
  • 本次修改不包含恶意代码

Summary by Sourcery

Add Exa as a new configurable web search provider, unify web search tooling and reference handling, and enhance configurability and robustness of web search integrations.

New Features:

  • Introduce Exa-based semantic web search tools for general search, content extraction, and finding similar pages.
  • Allow configuring Tavily and Exa API base URLs to support proxies and self-hosted or relay deployments.
  • Expose optional per-call timeout parameters for built-in web search tools across providers.

Enhancements:

  • Refactor web search reference extraction into shared utilities and extend it to support additional tools and fallback behaviors when no explicit refs are present.
  • Normalize and validate web search provider base URLs and improve error messaging for misconfigured endpoints.
  • Unify web search tool registration and application in the main agent, including Exa tools, and ensure favicon caching is reused in reference displays.

Documentation:

  • Update Chinese and English web search guides to document all supported providers, including Exa, along with their configuration steps.

Tests:

  • Add unit tests for web search utilities covering reference collection, base URL normalization, and error cases.
  • Add unit tests for Exa web search tool behavior, Tavily URL extractor validation, and BoCha builtin config rule registration.

- 新增 Exa 搜索提供商,包含三个工具:
  - web_search_exa:语义搜索,支持 5 种搜索类型和 6 个垂直领域
  - exa_extract_web_page:通过 /contents 端点提取网页全文
  - exa_find_similar:通过 /findSimilar 端点查找语义相似网页
- Tavily 和 Exa 的 API Base URL 可在 WebUI 中配置,方便代理/自建实例
- 所有联网搜索工具统一添加可配置 timeout 参数(最小 30s)
- MessageList.vue 引用解析支持 Exa/BoCha/findSimilar
- 更新配置元数据、i18n、路由及 hooks
- 更新中英文用户文档,补充 Tavily/BoCha/百度AI搜索的工具参数说明
Copilot AI review requested due to automatic review settings April 4, 2026 16:20
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 4, 2026
@dosubot dosubot Bot added area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. area:webui The bug / feature is about webui(dashboard) of astrbot. labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The minimum-timeout enforcement logic (if timeout < 30: timeout = 30) is duplicated across many tools (fetch_url, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level MIN_TIMEOUT constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., _web_search_exa currently lacks the clamp).
  • The Exa API key missing error message is in Chinese in _get_exa_key while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
  • The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., astr_agent_hooks._extract_web_search_refs, dashboard routes, and MessageList.vue), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The minimum-timeout enforcement logic (`if timeout < 30: timeout = 30`) is duplicated across many tools (`fetch_url`, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level `MIN_TIMEOUT` constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., `_web_search_exa` currently lacks the clamp).
- The Exa API key missing error message is in Chinese in `_get_exa_key` while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
- The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., `astr_agent_hooks._extract_web_search_refs`, dashboard routes, and `MessageList.vue`), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.

## Individual Comments

### Comment 1
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="159-168" />
<code_context>
         self,
         cfg: AstrBotConfig,
         payload: dict,
+        timeout: int = 30,
     ) -> list[SearchResult]:
         """使用 Tavily 搜索引擎进行搜索"""
</code_context>
<issue_to_address>
**suggestion:** Normalize the timeout value inside `_web_search_exa` for consistency and safety.

Other helpers (`_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, `_find_similar_exa`) all enforce a minimum 30s timeout internally, while `_web_search_exa` relies on its caller (`search_from_exa`) to clamp the value. If `_web_search_exa` is reused elsewhere, it may see much smaller timeouts and behave inconsistently. Please add the same `if timeout < 30: timeout = 30` guard at the top of `_web_search_exa` to align behavior and avoid unexpectedly short timeouts.

Suggested implementation:

```python
    ) -> list[SearchResult]:
        """使用 Exa 搜索引擎进行搜索"""
        if timeout < 30:
            timeout = 30

```

If the `_web_search_exa` signature or docstring differ slightly (e.g., different Chinese text or no docstring), adjust the SEARCH pattern to match the actual function header and insert:

    if timeout < 30:
        timeout = 30

as the first statement in the function body, immediately after any docstring, to keep behavior consistent with `_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, and `_find_similar_exa`.
</issue_to_address>

### Comment 2
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="68" />
<code_context>
         """清理文本,去除空格、换行符等"""
         return text.strip().replace("\n", " ").replace("\r", " ").replace("  ", " ")

-    async def _get_from_url(self, url: str) -> str:
+    async def _get_from_url(self, url: str, timeout: int = 30) -> str:
         """获取网页内容"""
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for timeout handling, base-URL construction, and Exa HTTP requests to remove duplication and make the web search code easier to maintain.

You can keep all the new functionality while cutting a lot of duplication with a few small helpers. The main hot spots are timeout handling, base URL construction, and Exa HTTP calls.

### 1. Centralize timeout normalization

The `if timeout < 30: timeout = 30` pattern is repeated many times.

Add a helper:

```python
def _normalize_timeout(self, timeout: int | None, minimum: int = 30) -> aiohttp.ClientTimeout:
    if timeout is None:
        timeout = minimum
    elif timeout < minimum:
        timeout = minimum
    return aiohttp.ClientTimeout(total=timeout)
```

Then use it at call sites instead of repeating the logic:

```python
async def _web_search_tavily(self, cfg: AstrBotConfig, payload: dict, timeout: int = 30) -> list[SearchResult]:
    tavily_key = await self._get_tavily_key(cfg)
    base_url = self._tavily_base_url(cfg)
    url = f"{base_url}/search"
    header = {
        "Authorization": f"Bearer {tavily_key}",
        "Content-Type": "application/json",
    }
    timeout_obj = self._normalize_timeout(timeout)
    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
            ...
```

And for tools you can drop the inline clamp:

```python
@llm_tool(name="fetch_url")
async def fetch_website_content(self, event: AstrMessageEvent, url: str, timeout: int = 30) -> str:
    timeout_obj = self._normalize_timeout(timeout)
    resp = await self._get_from_url(url, timeout_obj.total)
    return resp
```

(or just pass `timeout_obj` through if you adjust `_get_from_url`).

### 2. Extract base URL helpers for providers

The Tavily and Exa base URL logic is repeated.

Add:

```python
def _tavily_base_url(self, cfg: AstrBotConfig) -> str:
    return (
        cfg.get("provider_settings", {})
        .get("websearch_tavily_base_url", "https://api.tavily.com")
        .rstrip("/")
    )

def _exa_base_url(self, cfg: AstrBotConfig) -> str:
    return (
        cfg.get("provider_settings", {})
        .get("websearch_exa_base_url", "https://api.exa.ai")
        .rstrip("/")
    )
```

Then simplify call sites:

```python
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
```

```python
base_url = self._exa_base_url(cfg)
url = f"{base_url}/contents"
```

This removes duplication and keeps provider-specific config in one place.

### 3. Consolidate Exa HTTP request logic

`_web_search_exa`, `_extract_exa`, and `_find_similar_exa` all repeat the same HTTP boilerplate. You can pull that out into one internal helper that deals with key retrieval, base URL, headers, timeout, and error handling:

```python
async def _exa_request(
    self,
    cfg: AstrBotConfig,
    path: str,
    payload: dict,
    timeout: int = 30,
) -> dict:
    exa_key = await self._get_exa_key(cfg)
    base_url = self._exa_base_url(cfg)
    url = f"{base_url}/{path.lstrip('/')}"
    header = {
        "x-api-key": exa_key,
        "Content-Type": "application/json",
    }
    timeout_obj = self._normalize_timeout(timeout)

    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
            if response.status != 200:
                reason = await response.text()
                raise Exception(
                    f"Exa request to {path} failed: {reason}, status: {response.status}",
                )
            return await response.json()
```

Then each high-level method only shapes payload and maps results:

```python
async def _web_search_exa(
    self,
    cfg: AstrBotConfig,
    payload: dict,
    timeout: int = 30,
) -> list[SearchResult]:
    data = await self._exa_request(cfg, "search", payload, timeout=timeout)
    results: list[SearchResult] = []
    for item in data.get("results", []):
        results.append(
            SearchResult(
                title=item.get("title", ""),
                url=item.get("url", ""),
                snippet=(item.get("text") or "")[:500],
            )
        )
    return results
```

```python
async def _extract_exa(
    self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[dict]:
    data = await self._exa_request(cfg, "contents", payload, timeout=timeout)
    results: list[dict] = data.get("results", [])
    if not results:
        raise ValueError("Error: Exa content extraction does not return any results.")
    return results
```

```python
async def _find_similar_exa(
    self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[SearchResult]:
    data = await self._exa_request(cfg, "findSimilar", payload, timeout=timeout)
    results: list[SearchResult] = []
    for item in data.get("results", []):
        results.append(
            SearchResult(
                title=item.get("title", ""),
                url=item.get("url", ""),
                snippet=(item.get("text") or "")[:500],
            )
        )
    return results
```

This way, if you change headers, auth, or error handling, you only touch `_exa_request`.

### 4. Optional: small helpers for repeated validation

If you want to further simplify the public tool methods, a couple of tiny validators can keep them focused on behavior rather than plumbing.

For example, Exa config check and clamping:

```python
def _ensure_exa_config(self, cfg: AstrBotConfig) -> None:
    if not cfg.get("provider_settings", {}).get("websearch_exa_key", []):
        raise ValueError("Error: Exa API key is not configured in AstrBot.")

def _clamp_results(self, value: int, minimum: int, maximum: int) -> int:
    return max(minimum, min(value, maximum))
```

Usage:

```python
@llm_tool("web_search_exa")
async def search_from_exa(..., max_results: int = 10, ...):
    ...
    cfg = self.context.get_config(umo=event.unified_msg_origin)
    self._ensure_exa_config(cfg)

    max_results = self._clamp_results(max_results, 1, 100)
    ...
```

These changes keep all functionality (timeouts, base URLs, Exa/Tavily features) but reduce repetition and make future changes safer and easier.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Exa search provider, adding tools for semantic search, web page extraction, and finding similar links. It also adds support for configurable base URLs for Tavily and Exa, and implements a minimum 30-second timeout across various search and extraction tools. Feedback includes addressing a potential IndexError during Exa API key rotation, reusing aiohttp sessions for efficiency, improving error handling when no extraction results are found, and refactoring tool management logic to reduce duplication.

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends AstrBot’s web search capabilities by adding an Exa provider (semantic search + extraction + similar-page discovery), making Tavily/Exa API base URLs configurable for proxy/self-hosted endpoints, and updating the dashboard and docs to reflect the expanded toolchain and citation parsing.

Changes:

  • Add Exa as a new websearch_provider, including web_search_exa, exa_extract_web_page, and exa_find_similar LLM tools.
  • Make Tavily/Exa API Base URL configurable and thread it through web search + URL extraction/KB upload flows.
  • Update WebUI citation/ref parsing and expand websearch documentation (ZH/EN) plus config metadata i18n.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
docs/zh/use/websearch.md Expands ZH docs for default/Tavily/Exa/BoCha/Baidu tool parameters and configuration.
docs/en/use/websearch.md Expands EN docs for provider options, tool parameters, and Base URL configuration.
dashboard/src/i18n/locales/zh-CN/features/config-metadata.json Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/i18n/locales/en-US/features/config-metadata.json Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/i18n/locales/ru-RU/features/config-metadata.json Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/components/chat/MessageList.vue Updates supported tool-call parsing to recognize Exa + findSimilar results for refs.
astrbot/dashboard/routes/live_chat.py Extends supported tool list for extracting <ref> citations (Exa + findSimilar).
astrbot/dashboard/routes/chat.py Extends supported tool list for extracting <ref> citations (Exa + findSimilar).
astrbot/core/knowledge_base/parsers/url_parser.py Adds Tavily Base URL support to the KB URL extractor wrapper.
astrbot/core/knowledge_base/kb_helper.py Plumbs Tavily Base URL into KB “upload from URL” extraction.
astrbot/core/config/default.py Adds new provider settings defaults + metadata for Exa and Base URLs.
astrbot/core/astr_agent_hooks.py Extends webchat citation-injection logic to Exa + findSimilar tools.
astrbot/builtin_stars/web_searcher/main.py Implements Exa tools, adds configurable Base URLs, and adds per-tool optional timeout support.

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/core/knowledge_base/parsers/url_parser.py
Comment thread docs/zh/use/websearch.md Outdated
Comment thread docs/zh/use/websearch.md Outdated
Comment thread docs/en/use/websearch.md Outdated
Comment thread docs/en/use/websearch.md Outdated
Comment thread dashboard/src/components/chat/MessageList.vue Outdated
piexian and others added 3 commits April 5, 2026 00:36
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- 重构 web_search_utils.py 为分层结构,新增 build_web_search_refs()
  和 _extract_ref_indices() 支持从 <ref> 标签提取引用索引
- 简化 chat.py/live_chat.py 中 ref 提取为调用 build_web_search_refs()
- MessageList.vue 新增 getMessageRefs() 在后端未返回 refs 时前端自行降级提取
- 修复 chat.py 中消息保存条件判断逻辑
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 4, 2026
@piexian piexian requested a review from Copilot April 6, 2026 03:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Comment thread dashboard/src/components/chat/MessageList.vue Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Comment thread astrbot/core/utils/web_search_utils.py Outdated
Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 370167fb39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread astrbot/core/utils/web_search_utils.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96e15f79ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ret_ls = []
for result in results:
ret_ls.append(f"URL: {result.get('url', 'No URL')}")
text = await self._tidy_text(result.get("text", "No content"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle null text before tidying Exa extraction output

exa_extract_web_page passes result.get("text", "No content") directly into _tidy_text, but Exa responses can include "text": null (or other non-string values) for pages it cannot extract. In that case _tidy_text calls .strip() on a non-string and raises, causing the whole tool call to fail instead of returning partial results. Please coerce text to a safe string before calling _tidy_text.

Useful? React with 👍 / 👎.

@zouyonghe
Copy link
Copy Markdown
Member

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
  • _normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
- _normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.

## Individual Comments

### Comment 1
<location path="astrbot/core/tools/web_search_tools.py" line_range="648-657" />
<code_context>
+        max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
</code_context>
<issue_to_address>
**issue (bug_risk):** Unvalidated int() casts for max_results can raise on malformed input and abort tool execution.

Both `ExaWebSearchTool` and `ExaFindSimilarTool` call `int(kwargs.get("max_results", 10))` directly, so a non-integer value (e.g. "10.5", "foo", or an unexpected type) will raise `ValueError` and abort the tool call. To match how other user-supplied numerics are handled, consider wrapping this in a helper or local try/except that falls back to 10 and then applies the 1–100 clamp, so invalid input degrades gracefully instead of throwing.
</issue_to_address>

### Comment 2
<location path="tests/unit/test_web_search_utils.py" line_range="96-105" />
<code_context>
+    assert [ref["index"] for ref in refs["used"]] == ["a152.1", "a152.2"]
+
+
+@pytest.mark.parametrize(
+    ("base_url", "expected_message"),
+    [
+        (
+            "exa.ai/search",
+            "Error: Exa API Base URL must start with http:// or https://. "
+            "Proxy base paths are allowed. Received: 'exa.ai/search'.",
+        ),
+    ],
+)
+def test_normalize_web_search_base_url_reports_invalid_value(
+    base_url: str, expected_message: str
+):
+    with pytest.raises(ValueError) as exc_info:
+        normalize_web_search_base_url(
+            base_url,
+            default="https://api.exa.ai",
+            provider_name="Exa",
+        )
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.

Key branches still aren’t covered:

- `base_url` is `None` or an empty string → should return the trimmed `default` value. Please add a parametrized test for this.
- Additional invalid values like `"ftp://exa.ai"`, `"https:///only-path"`, or `"http://"` (no netloc) → should raise `ValueError` with the same message format.

These cases will complete coverage for scheme/netloc validation and default handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/tools/web_search_tools.py Outdated
Comment on lines +96 to +105
@pytest.mark.parametrize(
("base_url", "expected_message"),
[
(
"exa.ai/search",
"Error: Exa API Base URL must start with http:// or https://. "
"Proxy base paths are allowed. Received: 'exa.ai/search'.",
),
],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.

Key branches still aren’t covered:

  • base_url is None or an empty string → should return the trimmed default value. Please add a parametrized test for this.
  • Additional invalid values like "ftp://exa.ai", "https:///only-path", or "http://" (no netloc) → should raise ValueError with the same message format.

These cases will complete coverage for scheme/netloc validation and default handling.

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Apr 19, 2026
@piexian piexian requested a review from Copilot April 19, 2026 07:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Comment thread astrbot/core/tools/web_search_tools.py Outdated
if search_type not in ("auto", "neural", "fast", "instant", "deep"):
search_type = "auto"

max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_results is parsed with int(kwargs.get("max_results", 10)) without error handling. Tool arguments come from the model and may be strings/invalid; a ValueError here will bubble up and abort the tool execution. Handle non-numeric values gracefully (e.g., try/except with a default, or validate and return a clear error).

Suggested change
max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
raw_max_results = kwargs.get("max_results", 10)
try:
parsed_max_results = int(raw_max_results)
except (TypeError, ValueError):
parsed_max_results = 10
max_results = max(1, min(parsed_max_results, 100))

Copilot uses AI. Check for mistakes.
Comment thread astrbot/core/tools/web_search_tools.py Outdated
Comment on lines +783 to +787
results = await _exa_find_similar(
provider_settings,
{
"url": url,
"numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)),
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numResults uses int(kwargs.get("max_results", 10)) without guarding against non-numeric input. Since tool args are model-provided, invalid types can raise ValueError and crash the tool call. Wrap the conversion in try/except (defaulting/clamping), or validate and return a tool error message.

Suggested change
results = await _exa_find_similar(
provider_settings,
{
"url": url,
"numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)),
try:
max_results = int(kwargs.get("max_results", 10))
except (TypeError, ValueError):
return "Error: max_results must be an integer."
max_results = max(1, min(max_results, 100))
results = await _exa_find_similar(
provider_settings,
{
"url": url,
"numResults": max_results,

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +152
return normalize_web_search_base_url(
provider_settings.get("websearch_tavily_base_url"),
default="https://api.tavily.com",
provider_name="Tavily",
)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_tavily_base_url() can raise ValueError via normalize_web_search_base_url(). That exception will propagate out of _tavily_search() and can abort the tool execution (and potentially the whole agent run) instead of returning a normal tool error result. Consider catching ValueError at the call site (or inside _get_tavily_base_url/_tavily_search) and returning a user-facing error string (or falling back to the default base URL with a warning).

Suggested change
return normalize_web_search_base_url(
provider_settings.get("websearch_tavily_base_url"),
default="https://api.tavily.com",
provider_name="Tavily",
)
default_base_url = "https://api.tavily.com"
try:
return normalize_web_search_base_url(
provider_settings.get("websearch_tavily_base_url"),
default=default_base_url,
provider_name="Tavily",
)
except ValueError as exc:
logger.warning(
"Invalid Tavily base URL configuration %r; falling back to %s. Error: %s",
provider_settings.get("websearch_tavily_base_url"),
default_base_url,
exc,
)
return default_base_url

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +160
return normalize_web_search_base_url(
provider_settings.get("websearch_exa_base_url"),
default="https://api.exa.ai",
provider_name="Exa",
)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_exa_base_url() can raise ValueError via normalize_web_search_base_url(). Since callers (_exa_search/_exa_extract/_exa_find_similar) don't handle it, an invalid configured base URL can raise out of the tool call and abort execution. Catch ValueError and surface a stable tool error message (or fall back to the default base URL) so misconfiguration doesn't crash the run.

Suggested change
return normalize_web_search_base_url(
provider_settings.get("websearch_exa_base_url"),
default="https://api.exa.ai",
provider_name="Exa",
)
default_base_url = "https://api.exa.ai"
try:
return normalize_web_search_base_url(
provider_settings.get("websearch_exa_base_url"),
default=default_base_url,
provider_name="Exa",
)
except ValueError as exc:
logger.warning(
"Invalid Exa API Base URL configured; falling back to default base URL %s: %s",
default_base_url,
exc,
)
return default_base_url

Copilot uses AI. Check for mistakes.
@Soulter Soulter force-pushed the master branch 2 times, most recently from faf411f to 0068960 Compare April 19, 2026 09:50
Copy link
Copy Markdown
Member

@zouyonghe zouyonghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here. Exa support and the ref-handling cleanup are useful, but I don't think this is ready to merge yet.

I found three issues that should be fixed first:

  1. Base URL handling is currently too permissive.
    normalize_web_search_base_url() now accepts endpoint URLs such as https://api.exa.ai/search, but the call sites still append endpoint suffixes themselves (/search, /contents, /findSimilar, /extract). That can produce broken URLs like .../search/search and .../extract/extract. This affects both the new Exa path and the Tavily extraction path.

  2. search_type allows an unsupported Exa value.
    ExaWebSearchTool currently accepts instant, but Exa's official docs list only auto, neural, fast, and deep for the search type. If instant is passed through, the upstream API will likely reject it.

  3. BochaWebSearchTool lost its builtin config registration.
    In this PR it is registered with plain @builtin_tool, while master uses @builtin_tool(config=_BOCHA_WEB_SEARCH_TOOL_CONFIG). That regresses builtin config status/tag rendering in the dashboard for Bocha.

CI is green, but the current tests do not cover these regression surfaces. I suggest fixing the above and adding focused tests for:

  • rejecting or correctly handling endpoint-level base URLs
  • the Exa search_type allowlist
  • Bocha builtin config metadata registration

After those are addressed, I think this can be re-reviewed quickly.

@piexian
Copy link
Copy Markdown
Contributor Author

piexian commented Apr 20, 2026

@zouyonghe On item 2: I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, instant is documented as a supported search type, and the current public docs also list deep-lite and deep-reasoning.

Official links:

Both pages currently list auto, fast, deep, deep-lite, deep-reasoning, instant, and neural. So I'm not removing instant; instead I'm updating the allowlist to match the current official Exa docs and adding focused tests around accepted and fallback values. I'm still addressing items 1 and 3 separately.

@piexian
Copy link
Copy Markdown
Contributor Author

piexian commented Apr 20, 2026

Replying to review #7359 (review), specifically item 2 (search_type / instant):

I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, instant is documented as a supported search type, and the public docs also list deep-lite and deep-reasoning.

Official links:

So I'm not removing instant based on the current upstream docs. Instead, I'm aligning the allowlist with the current Exa documentation and adding focused tests around accepted values and fallback behavior. Items 1 and 3 are being fixed separately.

… Exa search types

- Reject specific API endpoint paths (e.g., /search, /extract) in base URL
  normalization via new disallowed_path_suffixes parameter to prevent
  misconfiguration errors
- Add deep-lite and deep-reasoning to valid Exa search types and normalize
  search_type input before validation
- Add missing config parameter to BochaWebSearchTool builtin_tool decorator
  so provider status checks are properly registered
@zouyonghe
Copy link
Copy Markdown
Member

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The allowed Exa search_type values and category options are hard-coded in both _EXA_SEARCH_TYPES and the ExaWebSearchTool.parameters description; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation.
  • normalize_web_search_base_url can now raise ValueError during request construction (e.g., _get_exa_base_url, _get_tavily_base_url, URLExtractor.__init__); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The allowed Exa `search_type` values and `category` options are hard-coded in both `_EXA_SEARCH_TYPES` and the `ExaWebSearchTool.parameters` description; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation.
- `normalize_web_search_base_url` can now raise `ValueError` during request construction (e.g., `_get_exa_base_url`, `_get_tavily_base_url`, `URLExtractor.__init__`); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.

## Individual Comments

### Comment 1
<location path="tests/unit/test_web_search_utils.py" line_range="96-97" />
<code_context>
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+    ("search_type", "expected"),
+    [
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for defaulting behavior of `normalize_web_search_base_url` when `base_url` is `None` or empty.

Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes `base_url=None` and `base_url="  "` and asserts that `normalize_web_search_base_url` returns the normalized `default` value. This will ensure callers like `_get_tavily_base_url`, `_get_exa_base_url`, and `URLExtractor` can rely on the default when the field is left blank.

```suggestion
@pytest.mark.parametrize(
    ("base_url", "default"),
    [
        (None, "https://exa.ai"),
        ("   ", "https://exa.ai"),
    ],
)
def test_normalize_web_search_base_url_defaults_when_blank(
    base_url, default: str
):
    normalized = normalize_web_search_base_url(base_url, default=default)

    # Should return the normalized value of the provided default when base_url is blank/None
    assert normalized == normalize_web_search_base_url(default, default=default)


@pytest.mark.parametrize(
    ("base_url", "expected_message"),
```
</issue_to_address>

### Comment 2
<location path="docs/en/use/websearch.md" line_range="43" />
<code_context>
+
+Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.
+
 If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:

 ![](https://files.astrbot.app/docs/source/images/websearch/image1.png)
</code_context>
<issue_to_address>
**suggestion (typo):** “a better experience optimization” 这一表达不太自然,建议调整为更地道的英语搭配。

建议将该短语改为更自然的表达,例如使用 “a better optimized experience” 或直接 “a better experience”,如:"you will get a better (optimized) experience on AstrBot ChatUI"。

```suggestion
If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more:
```
</issue_to_address>

### Comment 3
<location path="astrbot/core/tools/web_search_tools.py" line_range="384" />
<code_context>
             ]


+async def _exa_search(
+    provider_settings: dict,
+    payload: dict,
</code_context>
<issue_to_address>
**issue (complexity):** Consider introducing shared helper functions for Exa HTTP calls and POST+timeout handling to remove duplication while keeping behavior unchanged.

You can reduce the new complexity without changing behavior by introducing two small helpers:

1. **Unify Exa HTTP calls**

The three Exa helpers are almost identical except for path, action, and result mapping. A small internal helper keeps all behavior but removes repetition and makes future changes safer:

```python
async def _exa_request(
    provider_settings: dict,
    path: str,
    payload: dict,
    action: str,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> dict:
    exa_key = await _EXA_KEY_ROTATOR.get(provider_settings)
    url = f"{_get_exa_base_url(provider_settings)}{path}"
    headers = {
        "x-api-key": exa_key,
        "Content-Type": "application/json",
    }
    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(
            url,
            json=payload,
            headers=headers,
            timeout=_normalize_timeout(timeout),
        ) as response:
            if response.status != 200:
                reason = await response.text()
                raise Exception(
                    _format_provider_request_error(
                        "Exa", action, url, reason, response.status
                    )
                )
            return await response.json()
```

Then the three public helpers become focused on payload + mapping:

```python
async def _exa_search(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/search",
        payload,
        action="web search",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]


async def _exa_extract(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[dict]:
    data = await _exa_request(
        provider_settings,
        "/contents",
        payload,
        action="content extraction",
        timeout=timeout,
    )
    return data.get("results", [])


async def _exa_find_similar(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/findSimilar",
        payload,
        action="find similar",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]
```

2. **Deduplicate timeout plumbing for HTTP calls**

You now repeat `timeout=_normalize_timeout(timeout)` in many places. A very small wrapper keeps the new timeout behavior but centralizes it:

```python
async def _post_json(
    session: aiohttp.ClientSession,
    url: str,
    *,
    json: dict,
    headers: dict,
    timeout: int | float | str | None = None,
):
    return await session.post(
        url,
        json=json,
        headers=headers,
        timeout=_normalize_timeout(timeout),
    )
```

Usage example (Tavily/BoCha/Baidu/Exa):

```python
async with aiohttp.ClientSession(trust_env=True) as session:
    async with _post_json(
        session,
        url,
        json=payload,
        headers=header,
        timeout=timeout,
    ) as response:
        ...
```

This keeps all existing behavior (including the minimum timeout and trust_env) but removes a lot of repeated arguments and makes future changes to timeout handling or session usage localized.
</issue_to_address>

### Comment 4
<location path="astrbot/core/utils/web_search_utils.py" line_range="49" />
<code_context>
+    return normalized
+
+
+def _iter_web_search_result_items(
+    accumulated_parts: list[dict[str, Any]],
+):
</code_context>
<issue_to_address>
**issue (complexity):** Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.

You can trim a layer or two without losing clarity or behavior.

### 1. Inline `_iter_web_search_result_items` into `collect_web_search_ref_items`

`_iter_web_search_result_items` is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:

```python
def collect_web_search_ref_items(
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> list[dict[str, Any]]:
    web_search_refs: list[dict[str, Any]] = []
    seen_indices: set[str] = set()

    for part in accumulated_parts:
        if part.get("type") != "tool_call" or not part.get("tool_calls"):
            continue

        for tool_call in part["tool_calls"]:
            if (
                tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS
                or not tool_call.get("result")
            ):
                continue

            result = tool_call["result"]
            try:
                result_data = json.loads(result) if isinstance(result, str) else result
            except json.JSONDecodeError:
                continue

            if not isinstance(result_data, dict):
                continue

            for item in result_data.get("results", []):
                if not isinstance(item, dict):
                    continue

                ref_index = item.get("index")
                if not ref_index or ref_index in seen_indices:
                    continue

                payload = {
                    "index": ref_index,
                    "url": item.get("url"),
                    "title": item.get("title"),
                    "snippet": item.get("snippet"),
                }
                if favicon_cache and payload["url"] in favicon_cache:
                    payload["favicon"] = favicon_cache[payload["url"]]

                web_search_refs.append(payload)
                seen_indices.add(ref_index)

    return web_search_refs
```

That removes one indirection while keeping the logic in a single, obvious place.

### 2. Collapse `_extract_ref_indices` into `build_web_search_refs`

The regex + ordering behavior is easier to understand if the extraction and selection live together:

```python
def build_web_search_refs(
    accumulated_text: str,
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> dict:
    ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache)
    if not ordered_refs:
        return {}

    refs_by_index = {ref["index"]: ref for ref in ordered_refs}

    # Inline _extract_ref_indices
    ref_indices: list[str] = []
    seen_indices: set[str] = set()
    for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text):
        ref_index = match.group(1).strip()
        if not ref_index or ref_index in seen_indices:
            continue
        ref_indices.append(ref_index)
        seen_indices.add(ref_index)

    used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index]
    if not used_refs:
        used_refs = ordered_refs

    return {"used": used_refs}
```

This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.

### 3. Simplify `collect_web_search_results` (or consider removing it)

If you still need `collect_web_search_results`, it can be a very thin adapter, making the relationship to `collect_web_search_ref_items` obvious:

```python
def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict:
    return {
        ref["index"]: {
            "url": ref.get("url"),
            "title": ref.get("title"),
            "snippet": ref.get("snippet"),
        }
        for ref in collect_web_search_ref_items(accumulated_parts)
    }
```

If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop `collect_web_search_results` entirely to reduce the public API surface.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +96 to +97
@pytest.mark.parametrize(
("base_url", "expected_message"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add coverage for defaulting behavior of normalize_web_search_base_url when base_url is None or empty.

Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes base_url=None and base_url=" " and asserts that normalize_web_search_base_url returns the normalized default value. This will ensure callers like _get_tavily_base_url, _get_exa_base_url, and URLExtractor can rely on the default when the field is left blank.

Suggested change
@pytest.mark.parametrize(
("base_url", "expected_message"),
@pytest.mark.parametrize(
("base_url", "default"),
[
(None, "https://exa.ai"),
(" ", "https://exa.ai"),
],
)
def test_normalize_web_search_base_url_defaults_when_blank(
base_url, default: str
):
normalized = normalize_web_search_base_url(base_url, default=default)
# Should return the normalized value of the provided default when base_url is blank/None
assert normalized == normalize_web_search_base_url(default, default=default)
@pytest.mark.parametrize(
("base_url", "expected_message"),

Comment thread docs/en/use/websearch.md

Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.

If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): “a better experience optimization” 这一表达不太自然,建议调整为更地道的英语搭配。

建议将该短语改为更自然的表达,例如使用 “a better optimized experience” 或直接 “a better experience”,如:"you will get a better (optimized) experience on AstrBot ChatUI"。

Suggested change
If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:
If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more:

]


async def _exa_search(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider introducing shared helper functions for Exa HTTP calls and POST+timeout handling to remove duplication while keeping behavior unchanged.

You can reduce the new complexity without changing behavior by introducing two small helpers:

  1. Unify Exa HTTP calls

The three Exa helpers are almost identical except for path, action, and result mapping. A small internal helper keeps all behavior but removes repetition and makes future changes safer:

async def _exa_request(
    provider_settings: dict,
    path: str,
    payload: dict,
    action: str,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> dict:
    exa_key = await _EXA_KEY_ROTATOR.get(provider_settings)
    url = f"{_get_exa_base_url(provider_settings)}{path}"
    headers = {
        "x-api-key": exa_key,
        "Content-Type": "application/json",
    }
    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(
            url,
            json=payload,
            headers=headers,
            timeout=_normalize_timeout(timeout),
        ) as response:
            if response.status != 200:
                reason = await response.text()
                raise Exception(
                    _format_provider_request_error(
                        "Exa", action, url, reason, response.status
                    )
                )
            return await response.json()

Then the three public helpers become focused on payload + mapping:

async def _exa_search(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/search",
        payload,
        action="web search",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]


async def _exa_extract(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[dict]:
    data = await _exa_request(
        provider_settings,
        "/contents",
        payload,
        action="content extraction",
        timeout=timeout,
    )
    return data.get("results", [])


async def _exa_find_similar(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/findSimilar",
        payload,
        action="find similar",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]
  1. Deduplicate timeout plumbing for HTTP calls

You now repeat timeout=_normalize_timeout(timeout) in many places. A very small wrapper keeps the new timeout behavior but centralizes it:

async def _post_json(
    session: aiohttp.ClientSession,
    url: str,
    *,
    json: dict,
    headers: dict,
    timeout: int | float | str | None = None,
):
    return await session.post(
        url,
        json=json,
        headers=headers,
        timeout=_normalize_timeout(timeout),
    )

Usage example (Tavily/BoCha/Baidu/Exa):

async with aiohttp.ClientSession(trust_env=True) as session:
    async with _post_json(
        session,
        url,
        json=payload,
        headers=header,
        timeout=timeout,
    ) as response:
        ...

This keeps all existing behavior (including the minimum timeout and trust_env) but removes a lot of repeated arguments and makes future changes to timeout handling or session usage localized.

return normalized


def _iter_web_search_result_items(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.

You can trim a layer or two without losing clarity or behavior.

1. Inline _iter_web_search_result_items into collect_web_search_ref_items

_iter_web_search_result_items is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:

def collect_web_search_ref_items(
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> list[dict[str, Any]]:
    web_search_refs: list[dict[str, Any]] = []
    seen_indices: set[str] = set()

    for part in accumulated_parts:
        if part.get("type") != "tool_call" or not part.get("tool_calls"):
            continue

        for tool_call in part["tool_calls"]:
            if (
                tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS
                or not tool_call.get("result")
            ):
                continue

            result = tool_call["result"]
            try:
                result_data = json.loads(result) if isinstance(result, str) else result
            except json.JSONDecodeError:
                continue

            if not isinstance(result_data, dict):
                continue

            for item in result_data.get("results", []):
                if not isinstance(item, dict):
                    continue

                ref_index = item.get("index")
                if not ref_index or ref_index in seen_indices:
                    continue

                payload = {
                    "index": ref_index,
                    "url": item.get("url"),
                    "title": item.get("title"),
                    "snippet": item.get("snippet"),
                }
                if favicon_cache and payload["url"] in favicon_cache:
                    payload["favicon"] = favicon_cache[payload["url"]]

                web_search_refs.append(payload)
                seen_indices.add(ref_index)

    return web_search_refs

That removes one indirection while keeping the logic in a single, obvious place.

2. Collapse _extract_ref_indices into build_web_search_refs

The regex + ordering behavior is easier to understand if the extraction and selection live together:

def build_web_search_refs(
    accumulated_text: str,
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> dict:
    ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache)
    if not ordered_refs:
        return {}

    refs_by_index = {ref["index"]: ref for ref in ordered_refs}

    # Inline _extract_ref_indices
    ref_indices: list[str] = []
    seen_indices: set[str] = set()
    for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text):
        ref_index = match.group(1).strip()
        if not ref_index or ref_index in seen_indices:
            continue
        ref_indices.append(ref_index)
        seen_indices.add(ref_index)

    used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index]
    if not used_refs:
        used_refs = ordered_refs

    return {"used": used_refs}

This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.

3. Simplify collect_web_search_results (or consider removing it)

If you still need collect_web_search_results, it can be a very thin adapter, making the relationship to collect_web_search_ref_items obvious:

def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict:
    return {
        ref["index"]: {
            "url": ref.get("url"),
            "title": ref.get("title"),
            "snippet": ref.get("snippet"),
        }
        for ref in collect_web_search_ref_items(accumulated_parts)
    }

If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop collect_web_search_results entirely to reduce the public API surface.

@zouyonghe zouyonghe self-requested a review April 21, 2026 01:16
@zouyonghe
Copy link
Copy Markdown
Member

Quick re-check on the latest head:

  • CI is green, and the earlier base URL / Bocha config issues look resolved.
  • I still do not think this is merge-ready yet because live_chat now drops llm_checkpoint_id in the message_saved event after switching to build_message_saved_event(). In astrbot/dashboard/routes/live_chat.py, the payload built in the flush_pending_bot_message() path no longer includes llm_checkpoint_id, while the frontend still uses that field to bind the bot message to its checkpoint in dashboard/src/composables/useMessages.ts and dashboard/src/components/chat/ThreadPanel.vue.
  • That means follow-up operations for newly streamed live-chat bot messages can lose their checkpoint link until a full reload, which is a real regression.
  • There is also a smaller robustness issue in the new Exa tools: max_results is still parsed with raw int(...), so malformed tool args can raise instead of degrading gracefully.

Please fix the live_chat checkpoint regression before merge. The max_results handling can be tightened in the same pass.

surface Exa content extraction status errors with URL and error tag details; extract count validation into reusable _normalize_count helper; pass llm_checkpoint_id through build_message_saved_event parameter
@piexian
Copy link
Copy Markdown
Contributor Author

piexian commented May 2, 2026

Quick re-check on the latest head:

  • CI is green, and the earlier base URL / Bocha config issues look resolved.
  • I still do not think this is merge-ready yet because now drops in the event after switching to . In , the payload built in the path no longer includes , while the frontend still uses that field to bind the bot message to its checkpoint in and .live_chat``llm_checkpoint_id``message_saved``build_message_saved_event()``astrbot/dashboard/routes/live_chat.py``flush_pending_bot_message()``llm_checkpoint_id``dashboard/src/composables/useMessages.ts``dashboard/src/components/chat/ThreadPanel.vue
  • That means follow-up operations for newly streamed live-chat bot messages can lose their checkpoint link until a full reload, which is a real regression.
  • There is also a smaller robustness issue in the new Exa tools: is still parsed with raw , so malformed tool args can raise instead of degrading gracefully.max_results``int(...)

Please fix the checkpoint regression before merge. The handling can be tightened in the same pass.live_chat``max_results

Quick re-check on the latest head:

  • CI is green, and the earlier base URL / Bocha config issues look resolved.
  • I still do not think this is merge-ready yet because now drops in the event after switching to . In , the payload built in the path no longer includes , while the frontend still uses that field to bind the bot message to its checkpoint in and .live_chat``llm_checkpoint_id``message_saved``build_message_saved_event()``astrbot/dashboard/routes/live_chat.py``flush_pending_bot_message()``llm_checkpoint_id``dashboard/src/composables/useMessages.ts``dashboard/src/components/chat/ThreadPanel.vue
  • That means follow-up operations for newly streamed live-chat bot messages can lose their checkpoint link until a full reload, which is a real regression.
  • There is also a smaller robustness issue in the new Exa tools: is still parsed with raw , so malformed tool args can raise instead of degrading gracefully.max_results``int(...)

Please fix the checkpoint regression before merge. The handling can be tightened in the same pass.live_chat``max_results

live_chat checkpoint 回归:build_message_saved_event() 已添加 llm_checkpoint_id
参数(message_events.py:8),live_chat.py:600-603 的调用处已传入该参数。chat.py 的调用也同步改为参数形式,不再用 dict
后置注入。

Exa max_results:ExaWebSearchTool 和 ExaFindSimilarTool 都已改用 _normalize_count() 替代裸 int() 调用,非法值(如
"not-a-number")会安全回退到默认值,不会抛 TypeError。对应测试也已覆盖。

但是我注意到BraveSearchTool仍用裸 int()需要改动嘛?还是就这样

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. area:webui The bug / feature is about webui(dashboard) of astrbot. size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 建议开放Tavily的API端点配置以接入公益站 [Feature] 添加 EXA MCP 用于网络搜索

3 participants