feat(websearch): 新增Exa搜索提供商，支持 Tavily/Exa API Base URL 可配置，补充搜索工具相关文档 by piexian · Pull Request #7359 · AstrBotDevs/AstrBot

piexian · 2026-04-04T16:20:11Z

动机

新增 Exa 作为网页搜索提供商，为用户提供基于语义理解的搜索能力（Closes [Feature] 添加 EXA MCP 用于网络搜索 #5621）
开放 Tavily 和 Exa 的 API Base URL 配置，支持第三方代理、自建实例或中转服务（Closes [Feature] 建议开放Tavily的API端点配置以接入公益站 #7322）

改动点

Exa 搜索提供商：新增三个 @llm_tool 工具
- web_search_exa：语义搜索，支持 5 种搜索类型（auto / neural / fast / instant / deep）和 6 个垂直领域（company / people / research paper / news / personal site / financial report）
- exa_extract_web_page：通过 /contents 端点提取网页全文内容
- exa_find_similar：通过 /findSimilar 端点查找语义相似网页
API Base URL 可配置：Tavily 和 Exa 的 Base URL 可在 WebUI 中自定义，改动覆盖 web_searcher、url_parser、kb_helper 等链路
可选超时配置：AstrBot 自带联网搜索工具支持可选 timeout 参数，默认 30 秒
配置元数据 i18n：default.py 新增配置项及条件渲染元数据，en-US / ru-RU / zh-CN 三语同步更新
工具管理与共享能力整理：
- 将 Web Search 相关工具名单和通用处理逻辑收敛到共享工具函数中
- 统一 Tavily / Exa Base URL 的规范化逻辑，方便多个模块复用
- astr_agent_hooks.py 在 WebChat 中为搜索类工具补充 <ref>index</ref> 引用提示，帮助模型输出可追踪的来源标记
引用来源链路补全：
- 后端统一从结构化搜索结果中构建引用来源，不再只依赖正文中的 <ref>
- chat.py / live_chat.py 共用网页搜索引用提取逻辑
- MessageList.vue 的 <ref> 引用解析支持 Exa / BoCha / exa_find_similar，不再只识别 web_search_tavily
- 当前端收到只有工具结果、没有最终正文或没有 <ref> 的消息时，仍可降级展示来源列表
测试补充：新增 tests/unit/test_web_search_utils.py，覆盖网页搜索结果映射、favicon 透传、显式 <ref> 命中和无 <ref> 回退等场景
文档：中英文 websearch.md 补充 default / Tavily / Baidu AI Search / BoCha / Exa 的工具说明与参数说明
这不是一个破坏性变更

运行截图或测试结果

本地验证命令：

uv run ruff format .
uv run ruff check .
uv run pytest tests/unit/test_web_search_utils.py
pnpm --dir dashboard exec vue-tsc --noEmit
pnpm --dir dashboard run build

检查清单

如果 PR 中有新增功能，已经通过 Issue / 邮件等方式和作者讨论过
已完成必要测试，并在上方提供验证命令和运行截图
未引入新的依赖；如有新增依赖，已同步到相应配置文件
本次修改不包含恶意代码

Summary by Sourcery

Add Exa as a new configurable web search provider, unify web search tooling and reference handling, and enhance configurability and robustness of web search integrations.

New Features:

Introduce Exa-based semantic web search tools for general search, content extraction, and finding similar pages.
Allow configuring Tavily and Exa API base URLs to support proxies and self-hosted or relay deployments.
Expose optional per-call timeout parameters for built-in web search tools across providers.

Enhancements:

Refactor web search reference extraction into shared utilities and extend it to support additional tools and fallback behaviors when no explicit refs are present.
Normalize and validate web search provider base URLs and improve error messaging for misconfigured endpoints.
Unify web search tool registration and application in the main agent, including Exa tools, and ensure favicon caching is reused in reference displays.

Documentation:

Update Chinese and English web search guides to document all supported providers, including Exa, along with their configuration steps.

Tests:

Add unit tests for web search utilities covering reference collection, base URL normalization, and error cases.
Add unit tests for Exa web search tool behavior, Tavily URL extractor validation, and BoCha builtin config rule registration.

- 新增 Exa 搜索提供商，包含三个工具： - web_search_exa：语义搜索，支持 5 种搜索类型和 6 个垂直领域 - exa_extract_web_page：通过 /contents 端点提取网页全文 - exa_find_similar：通过 /findSimilar 端点查找语义相似网页 - Tavily 和 Exa 的 API Base URL 可在 WebUI 中配置，方便代理/自建实例 - 所有联网搜索工具统一添加可配置 timeout 参数（最小 30s） - MessageList.vue 引用解析支持 Exa/BoCha/findSimilar - 更新配置元数据、i18n、路由及 hooks - 更新中英文用户文档，补充 Tavily/BoCha/百度AI搜索的工具参数说明

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The minimum-timeout enforcement logic (if timeout < 30: timeout = 30) is duplicated across many tools (fetch_url, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level MIN_TIMEOUT constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., _web_search_exa currently lacks the clamp).
The Exa API key missing error message is in Chinese in _get_exa_key while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., astr_agent_hooks._extract_web_search_refs, dashboard routes, and MessageList.vue), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The minimum-timeout enforcement logic (`if timeout < 30: timeout = 30`) is duplicated across many tools (`fetch_url`, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level `MIN_TIMEOUT` constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., `_web_search_exa` currently lacks the clamp).
- The Exa API key missing error message is in Chinese in `_get_exa_key` while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
- The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., `astr_agent_hooks._extract_web_search_refs`, dashboard routes, and `MessageList.vue`), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.

## Individual Comments

### Comment 1
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="159-168" />
<code_context>
         self,
         cfg: AstrBotConfig,
         payload: dict,
+        timeout: int = 30,
     ) -> list[SearchResult]:
         """使用 Tavily 搜索引擎进行搜索"""
</code_context>
<issue_to_address>
**suggestion:** Normalize the timeout value inside `_web_search_exa` for consistency and safety.

Other helpers (`_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, `_find_similar_exa`) all enforce a minimum 30s timeout internally, while `_web_search_exa` relies on its caller (`search_from_exa`) to clamp the value. If `_web_search_exa` is reused elsewhere, it may see much smaller timeouts and behave inconsistently. Please add the same `if timeout < 30: timeout = 30` guard at the top of `_web_search_exa` to align behavior and avoid unexpectedly short timeouts.

Suggested implementation:

```python
    ) -> list[SearchResult]:
        """使用 Exa 搜索引擎进行搜索"""
        if timeout < 30:
            timeout = 30

```

If the `_web_search_exa` signature or docstring differ slightly (e.g., different Chinese text or no docstring), adjust the SEARCH pattern to match the actual function header and insert:

    if timeout < 30:
        timeout = 30

as the first statement in the function body, immediately after any docstring, to keep behavior consistent with `_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, and `_find_similar_exa`.
</issue_to_address>

### Comment 2
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="68" />
<code_context>
         """清理文本，去除空格、换行符等"""
         return text.strip().replace("\n", " ").replace("\r", " ").replace("  ", " ")

-    async def _get_from_url(self, url: str) -> str:
+    async def _get_from_url(self, url: str, timeout: int = 30) -> str:
         """获取网页内容"""
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for timeout handling, base-URL construction, and Exa HTTP requests to remove duplication and make the web search code easier to maintain.

You can keep all the new functionality while cutting a lot of duplication with a few small helpers. The main hot spots are timeout handling, base URL construction, and Exa HTTP calls.

### 1. Centralize timeout normalization

The `if timeout < 30: timeout = 30` pattern is repeated many times.

Add a helper:

```python
def _normalize_timeout(self, timeout: int | None, minimum: int = 30) -> aiohttp.ClientTimeout:
    if timeout is None:
        timeout = minimum
    elif timeout < minimum:
        timeout = minimum
    return aiohttp.ClientTimeout(total=timeout)
```

Then use it at call sites instead of repeating the logic:

```python
async def _web_search_tavily(self, cfg: AstrBotConfig, payload: dict, timeout: int = 30) -> list[SearchResult]:
    tavily_key = await self._get_tavily_key(cfg)
    base_url = self._tavily_base_url(cfg)
    url = f"{base_url}/search"
    header = {
        "Authorization": f"Bearer {tavily_key}",
        "Content-Type": "application/json",
    }
    timeout_obj = self._normalize_timeout(timeout)
    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
            ...
```

And for tools you can drop the inline clamp:

```python
@llm_tool(name="fetch_url")
async def fetch_website_content(self, event: AstrMessageEvent, url: str, timeout: int = 30) -> str:
    timeout_obj = self._normalize_timeout(timeout)
    resp = await self._get_from_url(url, timeout_obj.total)
    return resp
```

(or just pass `timeout_obj` through if you adjust `_get_from_url`).

### 2. Extract base URL helpers for providers

The Tavily and Exa base URL logic is repeated.

Add:

```python
def _tavily_base_url(self, cfg: AstrBotConfig) -> str:
    return (
        cfg.get("provider_settings", {})
        .get("websearch_tavily_base_url", "https://api.tavily.com")
        .rstrip("/")
    )

def _exa_base_url(self, cfg: AstrBotConfig) -> str:
    return (
        cfg.get("provider_settings", {})
        .get("websearch_exa_base_url", "https://api.exa.ai")
        .rstrip("/")
    )
```

Then simplify call sites:

```python
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
```

```python
base_url = self._exa_base_url(cfg)
url = f"{base_url}/contents"
```

This removes duplication and keeps provider-specific config in one place.

### 3. Consolidate Exa HTTP request logic

`_web_search_exa`, `_extract_exa`, and `_find_similar_exa` all repeat the same HTTP boilerplate. You can pull that out into one internal helper that deals with key retrieval, base URL, headers, timeout, and error handling:

```python
async def _exa_request(
    self,
    cfg: AstrBotConfig,
    path: str,
    payload: dict,
    timeout: int = 30,
) -> dict:
    exa_key = await self._get_exa_key(cfg)
    base_url = self._exa_base_url(cfg)
    url = f"{base_url}/{path.lstrip('/')}"
    header = {
        "x-api-key": exa_key,
        "Content-Type": "application/json",
    }
    timeout_obj = self._normalize_timeout(timeout)

    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
            if response.status != 200:
                reason = await response.text()
                raise Exception(
                    f"Exa request to {path} failed: {reason}, status: {response.status}",
                )
            return await response.json()
```

Then each high-level method only shapes payload and maps results:

```python
async def _web_search_exa(
    self,
    cfg: AstrBotConfig,
    payload: dict,
    timeout: int = 30,
) -> list[SearchResult]:
    data = await self._exa_request(cfg, "search", payload, timeout=timeout)
    results: list[SearchResult] = []
    for item in data.get("results", []):
        results.append(
            SearchResult(
                title=item.get("title", ""),
                url=item.get("url", ""),
                snippet=(item.get("text") or "")[:500],
            )
        )
    return results
```

```python
async def _extract_exa(
    self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[dict]:
    data = await self._exa_request(cfg, "contents", payload, timeout=timeout)
    results: list[dict] = data.get("results", [])
    if not results:
        raise ValueError("Error: Exa content extraction does not return any results.")
    return results
```

```python
async def _find_similar_exa(
    self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[SearchResult]:
    data = await self._exa_request(cfg, "findSimilar", payload, timeout=timeout)
    results: list[SearchResult] = []
    for item in data.get("results", []):
        results.append(
            SearchResult(
                title=item.get("title", ""),
                url=item.get("url", ""),
                snippet=(item.get("text") or "")[:500],
            )
        )
    return results
```

This way, if you change headers, auth, or error handling, you only touch `_exa_request`.

### 4. Optional: small helpers for repeated validation

If you want to further simplify the public tool methods, a couple of tiny validators can keep them focused on behavior rather than plumbing.

For example, Exa config check and clamping:

```python
def _ensure_exa_config(self, cfg: AstrBotConfig) -> None:
    if not cfg.get("provider_settings", {}).get("websearch_exa_key", []):
        raise ValueError("Error: Exa API key is not configured in AstrBot.")

def _clamp_results(self, value: int, minimum: int, maximum: int) -> int:
    return max(minimum, min(value, maximum))
```

Usage:

```python
@llm_tool("web_search_exa")
async def search_from_exa(..., max_results: int = 10, ...):
    ...
    cfg = self.context.get_config(umo=event.unified_msg_origin)
    self._ensure_exa_config(cfg)

    max_results = self._clamp_results(max_results, 1, 100)
    ...
```

These changes keep all functionality (timeouts, base URLs, Exa/Tavily features) but reduce repetition and make future changes safer and easier.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gemini-code-assist

Code Review

This pull request introduces the Exa search provider, adding tools for semantic search, web page extraction, and finding similar links. It also adds support for configurable base URLs for Tavily and Exa, and implements a minimum 30-second timeout across various search and extraction tools. Feedback includes addressing a potential IndexError during Exa API key rotation, reusing aiohttp sessions for efficiency, improving error handling when no extraction results are found, and refactoring tool management logic to reduce duplication.

Copilot

Pull request overview

This PR extends AstrBot’s web search capabilities by adding an Exa provider (semantic search + extraction + similar-page discovery), making Tavily/Exa API base URLs configurable for proxy/self-hosted endpoints, and updating the dashboard and docs to reflect the expanded toolchain and citation parsing.

Changes:

Add Exa as a new websearch_provider, including web_search_exa, exa_extract_web_page, and exa_find_similar LLM tools.
Make Tavily/Exa API Base URL configurable and thread it through web search + URL extraction/KB upload flows.
Update WebUI citation/ref parsing and expand websearch documentation (ZH/EN) plus config metadata i18n.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
docs/zh/use/websearch.md	Expands ZH docs for default/Tavily/Exa/BoCha/Baidu tool parameters and configuration.
docs/en/use/websearch.md	Expands EN docs for provider options, tool parameters, and Base URL configuration.
dashboard/src/i18n/locales/zh-CN/features/config-metadata.json	Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/i18n/locales/en-US/features/config-metadata.json	Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/i18n/locales/ru-RU/features/config-metadata.json	Adds metadata strings for Tavily/Exa Base URL + Exa key.
dashboard/src/components/chat/MessageList.vue	Updates supported tool-call parsing to recognize Exa + findSimilar results for refs.
astrbot/dashboard/routes/live_chat.py	Extends supported tool list for extracting `<ref>` citations (Exa + findSimilar).
astrbot/dashboard/routes/chat.py	Extends supported tool list for extracting `<ref>` citations (Exa + findSimilar).
astrbot/core/knowledge_base/parsers/url_parser.py	Adds Tavily Base URL support to the KB URL extractor wrapper.
astrbot/core/knowledge_base/kb_helper.py	Plumbs Tavily Base URL into KB “upload from URL” extraction.
astrbot/core/config/default.py	Adds new provider settings defaults + metadata for Exa and Base URLs.
astrbot/core/astr_agent_hooks.py	Extends webchat citation-injection logic to Exa + findSimilar tools.
astrbot/builtin_stars/web_searcher/main.py	Implements Exa tools, adds configurable Base URLs, and adds per-tool optional timeout support.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- 重构 web_search_utils.py 为分层结构，新增 build_web_search_refs() 和 _extract_ref_indices() 支持从 <ref> 标签提取引用索引 - 简化 chat.py/live_chat.py 中 ref 提取为调用 build_web_search_refs() - MessageList.vue 新增 getMessageRefs() 在后端未返回 refs 时前端自行降级提取 - 修复 chat.py 中消息保存条件判断逻辑

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 370167fb39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96e15f79ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T13:59:31Z

+        ret_ls = []
+        for result in results:
+            ret_ls.append(f"URL: {result.get('url', 'No URL')}")
+            text = await self._tidy_text(result.get("text", "No content"))


Handle null text before tidying Exa extraction output

exa_extract_web_page passes result.get("text", "No content") directly into _tidy_text, but Exa responses can include "text": null (or other non-string values) for pages it cannot extract. In that case _tidy_text calls .strip() on a non-string and raises, causing the whole tool call to fail instead of returning partial results. Please coerce text to a safe string before calling _tidy_text.

Useful? React with 👍 / 👎.

zouyonghe · 2026-04-10T01:05:37Z

@sourcery-ai review

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
_normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
- _normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.

## Individual Comments

### Comment 1
<location path="astrbot/core/tools/web_search_tools.py" line_range="648-657" />
<code_context>
+        max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
</code_context>
<issue_to_address>
**issue (bug_risk):** Unvalidated int() casts for max_results can raise on malformed input and abort tool execution.

Both `ExaWebSearchTool` and `ExaFindSimilarTool` call `int(kwargs.get("max_results", 10))` directly, so a non-integer value (e.g. "10.5", "foo", or an unexpected type) will raise `ValueError` and abort the tool call. To match how other user-supplied numerics are handled, consider wrapping this in a helper or local try/except that falls back to 10 and then applies the 1–100 clamp, so invalid input degrades gracefully instead of throwing.
</issue_to_address>

### Comment 2
<location path="tests/unit/test_web_search_utils.py" line_range="96-105" />
<code_context>
+    assert [ref["index"] for ref in refs["used"]] == ["a152.1", "a152.2"]
+
+
+@pytest.mark.parametrize(
+    ("base_url", "expected_message"),
+    [
+        (
+            "exa.ai/search",
+            "Error: Exa API Base URL must start with http:// or https://. "
+            "Proxy base paths are allowed. Received: 'exa.ai/search'.",
+        ),
+    ],
+)
+def test_normalize_web_search_base_url_reports_invalid_value(
+    base_url: str, expected_message: str
+):
+    with pytest.raises(ValueError) as exc_info:
+        normalize_web_search_base_url(
+            base_url,
+            default="https://api.exa.ai",
+            provider_name="Exa",
+        )
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.

Key branches still aren’t covered:

- `base_url` is `None` or an empty string → should return the trimmed `default` value. Please add a parametrized test for this.
- Additional invalid values like `"ftp://exa.ai"`, `"https:///only-path"`, or `"http://"` (no netloc) → should raise `ValueError` with the same message format.

These cases will complete coverage for scheme/netloc validation and default handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-10T01:08:06Z

+@pytest.mark.parametrize(
+    ("base_url", "expected_message"),
+    [
+        (
+            "exa.ai/search",
+            "Error: Exa API Base URL must start with http:// or https://. "
+            "Proxy base paths are allowed. Received: 'exa.ai/search'.",
+        ),
+    ],
+)


suggestion (testing): Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.

Key branches still aren’t covered:

base_url is None or an empty string → should return the trimmed default value. Please add a parametrized test for this.

Additional invalid values like "ftp://exa.ai", "https:///only-path", or "http://" (no netloc) → should raise ValueError with the same message format.

These cases will complete coverage for scheme/netloc validation and default handling.

…config pattern

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-19T07:16:12Z

+        if search_type not in ("auto", "neural", "fast", "instant", "deep"):
+            search_type = "auto"
+
+        max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))


max_results is parsed with int(kwargs.get("max_results", 10)) without error handling. Tool arguments come from the model and may be strings/invalid; a ValueError here will bubble up and abort the tool execution. Handle non-numeric values gracefully (e.g., try/except with a default, or validate and return a clear error).

Suggested change

max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))

raw_max_results = kwargs.get("max_results", 10)

try:

parsed_max_results = int(raw_max_results)

except (TypeError, ValueError):

parsed_max_results = 10

max_results = max(1, min(parsed_max_results, 100))

Copilot · 2026-04-19T07:16:12Z

+        results = await _exa_find_similar(
+            provider_settings,
+            {
+                "url": url,
+                "numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)),


numResults uses int(kwargs.get("max_results", 10)) without guarding against non-numeric input. Since tool args are model-provided, invalid types can raise ValueError and crash the tool call. Wrap the conversion in try/except (defaulting/clamping), or validate and return a tool error message.

Suggested change

results = await _exa_find_similar(

provider_settings,

{

"url": url,

"numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)),

try:

max_results = int(kwargs.get("max_results", 10))

except (TypeError, ValueError):

return "Error: max_results must be an integer."

max_results = max(1, min(max_results, 100))

results = await _exa_find_similar(

provider_settings,

{

"url": url,

"numResults": max_results,

Copilot · 2026-04-19T07:16:12Z

+    return normalize_web_search_base_url(
+        provider_settings.get("websearch_tavily_base_url"),
+        default="https://api.tavily.com",
+        provider_name="Tavily",
+    )


_get_tavily_base_url() can raise ValueError via normalize_web_search_base_url(). That exception will propagate out of _tavily_search() and can abort the tool execution (and potentially the whole agent run) instead of returning a normal tool error result. Consider catching ValueError at the call site (or inside _get_tavily_base_url/_tavily_search) and returning a user-facing error string (or falling back to the default base URL with a warning).

Suggested change

return normalize_web_search_base_url(

provider_settings.get("websearch_tavily_base_url"),

default="https://api.tavily.com",

provider_name="Tavily",

)

default_base_url = "https://api.tavily.com"

try:

return normalize_web_search_base_url(

provider_settings.get("websearch_tavily_base_url"),

default=default_base_url,

provider_name="Tavily",

)

except ValueError as exc:

logger.warning(

"Invalid Tavily base URL configuration %r; falling back to %s. Error: %s",

provider_settings.get("websearch_tavily_base_url"),

default_base_url,

exc,

)

return default_base_url

Copilot · 2026-04-19T07:16:13Z

+    return normalize_web_search_base_url(
+        provider_settings.get("websearch_exa_base_url"),
+        default="https://api.exa.ai",
+        provider_name="Exa",
+    )


_get_exa_base_url() can raise ValueError via normalize_web_search_base_url(). Since callers (_exa_search/_exa_extract/_exa_find_similar) don't handle it, an invalid configured base URL can raise out of the tool call and abort execution. Catch ValueError and surface a stable tool error message (or fall back to the default base URL) so misconfiguration doesn't crash the run.

Suggested change

return normalize_web_search_base_url(

provider_settings.get("websearch_exa_base_url"),

default="https://api.exa.ai",

provider_name="Exa",

)

default_base_url = "https://api.exa.ai"

try:

return normalize_web_search_base_url(

provider_settings.get("websearch_exa_base_url"),

default=default_base_url,

provider_name="Exa",

)

except ValueError as exc:

logger.warning(

"Invalid Exa API Base URL configured; falling back to default base URL %s: %s",

default_base_url,

exc,

)

return default_base_url

zouyonghe

Thanks for the work here. Exa support and the ref-handling cleanup are useful, but I don't think this is ready to merge yet.

I found three issues that should be fixed first:

Base URL handling is currently too permissive.
normalize_web_search_base_url() now accepts endpoint URLs such as https://api.exa.ai/search, but the call sites still append endpoint suffixes themselves (/search, /contents, /findSimilar, /extract). That can produce broken URLs like .../search/search and .../extract/extract. This affects both the new Exa path and the Tavily extraction path.
search_type allows an unsupported Exa value.
ExaWebSearchTool currently accepts instant, but Exa's official docs list only auto, neural, fast, and deep for the search type. If instant is passed through, the upstream API will likely reject it.
BochaWebSearchTool lost its builtin config registration.
In this PR it is registered with plain @builtin_tool, while master uses @builtin_tool(config=_BOCHA_WEB_SEARCH_TOOL_CONFIG). That regresses builtin config status/tag rendering in the dashboard for Bocha.

CI is green, but the current tests do not cover these regression surfaces. I suggest fixing the above and adding focused tests for:

rejecting or correctly handling endpoint-level base URLs
the Exa search_type allowlist
Bocha builtin config metadata registration

After those are addressed, I think this can be re-reviewed quickly.

piexian · 2026-04-20T01:49:56Z

@zouyonghe On item 2: I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, instant is documented as a supported search type, and the current public docs also list deep-lite and deep-reasoning.

Official links:

Search API: https://docs.exa.ai/reference/search
Python SDK specification: https://docs.exa.ai/sdks/python-sdk-specification

Both pages currently list auto, fast, deep, deep-lite, deep-reasoning, instant, and neural. So I'm not removing instant; instead I'm updating the allowlist to match the current official Exa docs and adding focused tests around accepted and fallback values. I'm still addressing items 1 and 3 separately.

piexian · 2026-04-20T01:51:01Z

Replying to review #7359 (review), specifically item 2 (search_type / instant):

I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, instant is documented as a supported search type, and the public docs also list deep-lite and deep-reasoning.

Official links:

Search API: https://docs.exa.ai/reference/search
Python SDK specification: https://docs.exa.ai/sdks/python-sdk-specification

So I'm not removing instant based on the current upstream docs. Instead, I'm aligning the allowlist with the current Exa documentation and adding focused tests around accepted values and fallback behavior. Items 1 and 3 are being fixed separately.

… Exa search types - Reject specific API endpoint paths (e.g., /search, /extract) in base URL normalization via new disallowed_path_suffixes parameter to prevent misconfiguration errors - Add deep-lite and deep-reasoning to valid Exa search types and normalize search_type input before validation - Add missing config parameter to BochaWebSearchTool builtin_tool decorator so provider status checks are properly registered

zouyonghe · 2026-04-21T00:01:19Z

@sourcery-ai review

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The allowed Exa search_type values and category options are hard-coded in both _EXA_SEARCH_TYPES and the ExaWebSearchTool.parameters description; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation.
normalize_web_search_base_url can now raise ValueError during request construction (e.g., _get_exa_base_url, _get_tavily_base_url, URLExtractor.__init__); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The allowed Exa `search_type` values and `category` options are hard-coded in both `_EXA_SEARCH_TYPES` and the `ExaWebSearchTool.parameters` description; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation.
- `normalize_web_search_base_url` can now raise `ValueError` during request construction (e.g., `_get_exa_base_url`, `_get_tavily_base_url`, `URLExtractor.__init__`); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.

## Individual Comments

### Comment 1
<location path="tests/unit/test_web_search_utils.py" line_range="96-97" />
<code_context>
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+    ("search_type", "expected"),
+    [
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for defaulting behavior of `normalize_web_search_base_url` when `base_url` is `None` or empty.

Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes `base_url=None` and `base_url="  "` and asserts that `normalize_web_search_base_url` returns the normalized `default` value. This will ensure callers like `_get_tavily_base_url`, `_get_exa_base_url`, and `URLExtractor` can rely on the default when the field is left blank.

```suggestion
@pytest.mark.parametrize(
    ("base_url", "default"),
    [
        (None, "https://exa.ai"),
        ("   ", "https://exa.ai"),
    ],
)
def test_normalize_web_search_base_url_defaults_when_blank(
    base_url, default: str
):
    normalized = normalize_web_search_base_url(base_url, default=default)

    # Should return the normalized value of the provided default when base_url is blank/None
    assert normalized == normalize_web_search_base_url(default, default=default)


@pytest.mark.parametrize(
    ("base_url", "expected_message"),
```
</issue_to_address>

### Comment 2
<location path="docs/en/use/websearch.md" line_range="43" />
<code_context>
+
+Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.
+
 If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:

 ![](https://files.astrbot.app/docs/source/images/websearch/image1.png)
</code_context>
<issue_to_address>
**suggestion (typo):** “a better experience optimization” 这一表达不太自然，建议调整为更地道的英语搭配。

建议将该短语改为更自然的表达，例如使用 “a better optimized experience” 或直接 “a better experience”，如："you will get a better (optimized) experience on AstrBot ChatUI"。

```suggestion
If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more:
```
</issue_to_address>

### Comment 3
<location path="astrbot/core/tools/web_search_tools.py" line_range="384" />
<code_context>
             ]


+async def _exa_search(
+    provider_settings: dict,
+    payload: dict,
</code_context>
<issue_to_address>
**issue (complexity):** Consider introducing shared helper functions for Exa HTTP calls and POST+timeout handling to remove duplication while keeping behavior unchanged.

You can reduce the new complexity without changing behavior by introducing two small helpers:

1. **Unify Exa HTTP calls**

The three Exa helpers are almost identical except for path, action, and result mapping. A small internal helper keeps all behavior but removes repetition and makes future changes safer:

```python
async def _exa_request(
    provider_settings: dict,
    path: str,
    payload: dict,
    action: str,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> dict:
    exa_key = await _EXA_KEY_ROTATOR.get(provider_settings)
    url = f"{_get_exa_base_url(provider_settings)}{path}"
    headers = {
        "x-api-key": exa_key,
        "Content-Type": "application/json",
    }
    async with aiohttp.ClientSession(trust_env=True) as session:
        async with session.post(
            url,
            json=payload,
            headers=headers,
            timeout=_normalize_timeout(timeout),
        ) as response:
            if response.status != 200:
                reason = await response.text()
                raise Exception(
                    _format_provider_request_error(
                        "Exa", action, url, reason, response.status
                    )
                )
            return await response.json()
```

Then the three public helpers become focused on payload + mapping:

```python
async def _exa_search(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/search",
        payload,
        action="web search",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]


async def _exa_extract(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[dict]:
    data = await _exa_request(
        provider_settings,
        "/contents",
        payload,
        action="content extraction",
        timeout=timeout,
    )
    return data.get("results", [])


async def _exa_find_similar(
    provider_settings: dict,
    payload: dict,
    timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
    data = await _exa_request(
        provider_settings,
        "/findSimilar",
        payload,
        action="find similar",
        timeout=timeout,
    )
    return [
        SearchResult(
            title=item.get("title", ""),
            url=item.get("url", ""),
            snippet=(item.get("text") or "")[:500],
        )
        for item in data.get("results", [])
    ]
```

2. **Deduplicate timeout plumbing for HTTP calls**

You now repeat `timeout=_normalize_timeout(timeout)` in many places. A very small wrapper keeps the new timeout behavior but centralizes it:

```python
async def _post_json(
    session: aiohttp.ClientSession,
    url: str,
    *,
    json: dict,
    headers: dict,
    timeout: int | float | str | None = None,
):
    return await session.post(
        url,
        json=json,
        headers=headers,
        timeout=_normalize_timeout(timeout),
    )
```

Usage example (Tavily/BoCha/Baidu/Exa):

```python
async with aiohttp.ClientSession(trust_env=True) as session:
    async with _post_json(
        session,
        url,
        json=payload,
        headers=header,
        timeout=timeout,
    ) as response:
        ...
```

This keeps all existing behavior (including the minimum timeout and trust_env) but removes a lot of repeated arguments and makes future changes to timeout handling or session usage localized.
</issue_to_address>

### Comment 4
<location path="astrbot/core/utils/web_search_utils.py" line_range="49" />
<code_context>
+    return normalized
+
+
+def _iter_web_search_result_items(
+    accumulated_parts: list[dict[str, Any]],
+):
</code_context>
<issue_to_address>
**issue (complexity):** Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.

You can trim a layer or two without losing clarity or behavior.

### 1. Inline `_iter_web_search_result_items` into `collect_web_search_ref_items`

`_iter_web_search_result_items` is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:

```python
def collect_web_search_ref_items(
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> list[dict[str, Any]]:
    web_search_refs: list[dict[str, Any]] = []
    seen_indices: set[str] = set()

    for part in accumulated_parts:
        if part.get("type") != "tool_call" or not part.get("tool_calls"):
            continue

        for tool_call in part["tool_calls"]:
            if (
                tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS
                or not tool_call.get("result")
            ):
                continue

            result = tool_call["result"]
            try:
                result_data = json.loads(result) if isinstance(result, str) else result
            except json.JSONDecodeError:
                continue

            if not isinstance(result_data, dict):
                continue

            for item in result_data.get("results", []):
                if not isinstance(item, dict):
                    continue

                ref_index = item.get("index")
                if not ref_index or ref_index in seen_indices:
                    continue

                payload = {
                    "index": ref_index,
                    "url": item.get("url"),
                    "title": item.get("title"),
                    "snippet": item.get("snippet"),
                }
                if favicon_cache and payload["url"] in favicon_cache:
                    payload["favicon"] = favicon_cache[payload["url"]]

                web_search_refs.append(payload)
                seen_indices.add(ref_index)

    return web_search_refs
```

That removes one indirection while keeping the logic in a single, obvious place.

### 2. Collapse `_extract_ref_indices` into `build_web_search_refs`

The regex + ordering behavior is easier to understand if the extraction and selection live together:

```python
def build_web_search_refs(
    accumulated_text: str,
    accumulated_parts: list[dict[str, Any]],
    favicon_cache: dict[str, str] | None = None,
) -> dict:
    ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache)
    if not ordered_refs:
        return {}

    refs_by_index = {ref["index"]: ref for ref in ordered_refs}

    # Inline _extract_ref_indices
    ref_indices: list[str] = []
    seen_indices: set[str] = set()
    for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text):
        ref_index = match.group(1).strip()
        if not ref_index or ref_index in seen_indices:
            continue
        ref_indices.append(ref_index)
        seen_indices.add(ref_index)

    used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index]
    if not used_refs:
        used_refs = ordered_refs

    return {"used": used_refs}
```

This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.

### 3. Simplify `collect_web_search_results` (or consider removing it)

If you still need `collect_web_search_results`, it can be a very thin adapter, making the relationship to `collect_web_search_ref_items` obvious:

```python
def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict:
    return {
        ref["index"]: {
            "url": ref.get("url"),
            "title": ref.get("title"),
            "snippet": ref.get("snippet"),
        }
        for ref in collect_web_search_ref_items(accumulated_parts)
    }
```

If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop `collect_web_search_results` entirely to reduce the public API surface.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-21T00:03:26Z

+@pytest.mark.parametrize(
+    ("base_url", "expected_message"),


suggestion (testing): Add coverage for defaulting behavior of normalize_web_search_base_url when base_url is None or empty.

Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes base_url=None and base_url=" " and asserts that normalize_web_search_base_url returns the normalized default value. This will ensure callers like _get_tavily_base_url, _get_exa_base_url, and URLExtractor can rely on the default when the field is left blank.

Suggested change

@pytest.mark.parametrize(

("base_url", "expected_message"),

@pytest.mark.parametrize(

("base_url", "default"),

[

(None, "https://exa.ai"),

(" ", "https://exa.ai"),

],

)

def test_normalize_web_search_base_url_defaults_when_blank(

base_url, default: str

):

normalized = normalize_web_search_base_url(base_url, default=default)

# Should return the normalized value of the provided default when base_url is blank/None

assert normalized == normalize_web_search_base_url(default, default=default)

@pytest.mark.parametrize(

("base_url", "expected_message"),

sourcery-ai · 2026-04-21T00:03:26Z

+
+Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.
+
 If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:


suggestion (typo): “a better experience optimization” 这一表达不太自然，建议调整为更地道的英语搭配。

建议将该短语改为更自然的表达，例如使用 “a better optimized experience” 或直接 “a better experience”，如："you will get a better (optimized) experience on AstrBot ChatUI"。

Suggested change

If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:

If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more:

sourcery-ai · 2026-04-21T00:03:26Z

            ]


+async def _exa_search(


issue (complexity): Consider introducing shared helper functions for Exa HTTP calls and POST+timeout handling to remove duplication while keeping behavior unchanged.

You can reduce the new complexity without changing behavior by introducing two small helpers:

Unify Exa HTTP calls

The three Exa helpers are almost identical except for path, action, and result mapping. A small internal helper keeps all behavior but removes repetition and makes future changes safer:

async def _exa_request( provider_settings: dict, path: str, payload: dict, action: str, timeout: int = MIN_WEB_SEARCH_TIMEOUT, ) -> dict: exa_key = await _EXA_KEY_ROTATOR.get(provider_settings) url = f"{_get_exa_base_url(provider_settings)}{path}" headers = { "x-api-key": exa_key, "Content-Type": "application/json", } async with aiohttp.ClientSession(trust_env=True) as session: async with session.post( url, json=payload, headers=headers, timeout=_normalize_timeout(timeout), ) as response: if response.status != 200: reason = await response.text() raise Exception( _format_provider_request_error( "Exa", action, url, reason, response.status ) ) return await response.json()

Then the three public helpers become focused on payload + mapping:

async def _exa_search( provider_settings: dict, payload: dict, timeout: int = MIN_WEB_SEARCH_TIMEOUT, ) -> list[SearchResult]: data = await _exa_request( provider_settings, "/search", payload, action="web search", timeout=timeout, ) return [ SearchResult( title=item.get("title", ""), url=item.get("url", ""), snippet=(item.get("text") or "")[:500], ) for item in data.get("results", []) ] async def _exa_extract( provider_settings: dict, payload: dict, timeout: int = MIN_WEB_SEARCH_TIMEOUT, ) -> list[dict]: data = await _exa_request( provider_settings, "/contents", payload, action="content extraction", timeout=timeout, ) return data.get("results", []) async def _exa_find_similar( provider_settings: dict, payload: dict, timeout: int = MIN_WEB_SEARCH_TIMEOUT, ) -> list[SearchResult]: data = await _exa_request( provider_settings, "/findSimilar", payload, action="find similar", timeout=timeout, ) return [ SearchResult( title=item.get("title", ""), url=item.get("url", ""), snippet=(item.get("text") or "")[:500], ) for item in data.get("results", []) ]

Deduplicate timeout plumbing for HTTP calls

You now repeat timeout=_normalize_timeout(timeout) in many places. A very small wrapper keeps the new timeout behavior but centralizes it:

async def _post_json( session: aiohttp.ClientSession, url: str, *, json: dict, headers: dict, timeout: int | float | str | None = None, ): return await session.post( url, json=json, headers=headers, timeout=_normalize_timeout(timeout), )

Usage example (Tavily/BoCha/Baidu/Exa):

async with aiohttp.ClientSession(trust_env=True) as session: async with _post_json( session, url, json=payload, headers=header, timeout=timeout, ) as response: ...

This keeps all existing behavior (including the minimum timeout and trust_env) but removes a lot of repeated arguments and makes future changes to timeout handling or session usage localized.

sourcery-ai · 2026-04-21T00:03:27Z

+    return normalized
+
+
+def _iter_web_search_result_items(


issue (complexity): Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.

You can trim a layer or two without losing clarity or behavior.

1. Inline _iter_web_search_result_items into collect_web_search_ref_items

_iter_web_search_result_items is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:

def collect_web_search_ref_items( accumulated_parts: list[dict[str, Any]], favicon_cache: dict[str, str] | None = None, ) -> list[dict[str, Any]]: web_search_refs: list[dict[str, Any]] = [] seen_indices: set[str] = set() for part in accumulated_parts: if part.get("type") != "tool_call" or not part.get("tool_calls"): continue for tool_call in part["tool_calls"]: if ( tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS or not tool_call.get("result") ): continue result = tool_call["result"] try: result_data = json.loads(result) if isinstance(result, str) else result except json.JSONDecodeError: continue if not isinstance(result_data, dict): continue for item in result_data.get("results", []): if not isinstance(item, dict): continue ref_index = item.get("index") if not ref_index or ref_index in seen_indices: continue payload = { "index": ref_index, "url": item.get("url"), "title": item.get("title"), "snippet": item.get("snippet"), } if favicon_cache and payload["url"] in favicon_cache: payload["favicon"] = favicon_cache[payload["url"]] web_search_refs.append(payload) seen_indices.add(ref_index) return web_search_refs

That removes one indirection while keeping the logic in a single, obvious place.

2. Collapse _extract_ref_indices into build_web_search_refs

The regex + ordering behavior is easier to understand if the extraction and selection live together:

def build_web_search_refs( accumulated_text: str, accumulated_parts: list[dict[str, Any]], favicon_cache: dict[str, str] | None = None, ) -> dict: ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache) if not ordered_refs: return {} refs_by_index = {ref["index"]: ref for ref in ordered_refs} # Inline _extract_ref_indices ref_indices: list[str] = [] seen_indices: set[str] = set() for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text): ref_index = match.group(1).strip() if not ref_index or ref_index in seen_indices: continue ref_indices.append(ref_index) seen_indices.add(ref_index) used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index] if not used_refs: used_refs = ordered_refs return {"used": used_refs}

This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.

3. Simplify collect_web_search_results (or consider removing it)

If you still need collect_web_search_results, it can be a very thin adapter, making the relationship to collect_web_search_ref_items obvious:

def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict: return { ref["index"]: { "url": ref.get("url"), "title": ref.get("title"), "snippet": ref.get("snippet"), } for ref in collect_web_search_ref_items(accumulated_parts) }

If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop collect_web_search_results entirely to reduce the public API surface.

zouyonghe · 2026-05-02T13:49:38Z

Quick re-check on the latest head:

CI is green, and the earlier base URL / Bocha config issues look resolved.
I still do not think this is merge-ready yet because live_chat now drops llm_checkpoint_id in the message_saved event after switching to build_message_saved_event(). In astrbot/dashboard/routes/live_chat.py, the payload built in the flush_pending_bot_message() path no longer includes llm_checkpoint_id, while the frontend still uses that field to bind the bot message to its checkpoint in dashboard/src/composables/useMessages.ts and dashboard/src/components/chat/ThreadPanel.vue.
That means follow-up operations for newly streamed live-chat bot messages can lose their checkpoint link until a full reload, which is a real regression.
There is also a smaller robustness issue in the new Exa tools: max_results is still parsed with raw int(...), so malformed tool args can raise instead of degrading gracefully.

Please fix the live_chat checkpoint regression before merge. The max_results handling can be tightened in the same pass.

surface Exa content extraction status errors with URL and error tag details; extract count validation into reusable _normalize_count helper; pass llm_checkpoint_id through build_message_saved_event parameter

piexian · 2026-05-02T15:00:08Z

Quick re-check on the latest head:

CI is green, and the earlier base URL / Bocha config issues look resolved.

I still do not think this is merge-ready yet because now drops in the event after switching to . In , the payload built in the path no longer includes , while the frontend still uses that field to bind the bot message to its checkpoint in and .live_chat``llm_checkpoint_id``message_saved``build_message_saved_event()``astrbot/dashboard/routes/live_chat.py``flush_pending_bot_message()``llm_checkpoint_id``dashboard/src/composables/useMessages.ts``dashboard/src/components/chat/ThreadPanel.vue

That means follow-up operations for newly streamed live-chat bot messages can lose their checkpoint link until a full reload, which is a real regression.

There is also a smaller robustness issue in the new Exa tools: is still parsed with raw , so malformed tool args can raise instead of degrading gracefully.max_results``int(...)

Please fix the checkpoint regression before merge. The handling can be tightened in the same pass.live_chat``max_results

live_chat checkpoint 回归：build_message_saved_event() 已添加 llm_checkpoint_id
参数（message_events.py:8），live_chat.py:600-603 的调用处已传入该参数。chat.py 的调用也同步改为参数形式，不再用 dict
后置注入。

Exa max_results：ExaWebSearchTool 和 ExaFindSimilarTool 都已改用 _normalize_count() 替代裸 int() 调用，非法值（如
"not-a-number"）会安全回退到默认值，不会抛 TypeError。对应测试也已覆盖。

但是我注意到BraveSearchTool仍用裸 int()需要改动嘛？还是就这样

Copilot AI review requested due to automatic review settings April 4, 2026 16:20

dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 4, 2026

Copilot started reviewing on behalf of piexian April 4, 2026 16:20 View session

dosubot Bot added area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. area:webui The bug / feature is about webui(dashboard) of astrbot. labels Apr 4, 2026

sourcery-ai Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

gemini-code-assist Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

Comment thread astrbot/builtin_stars/web_searcher/main.py Outdated

Copilot AI reviewed Apr 4, 2026

View reviewed changes

piexian and others added 3 commits April 5, 2026 00:36

Apply suggestions from code review

f0edbb9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(websearch): 修复全局 HEADERS 污染、密钥索引越界等问题，提取共享工具函数消除重复代码

479c58e

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 4, 2026

piexian requested a review from Copilot April 6, 2026 03:28

Copilot started reviewing on behalf of piexian April 6, 2026 03:28 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

fix(websearch): 修复 UUID 生成逻辑，确保唯一性；更新 API Base URL 错误提示信息；新增消息引用缓存机制

370167f

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread astrbot/core/utils/web_search_utils.py

fix(websearch): 放宽 API Base URL 校验，增强 Tavily/Exa 请求报错提示

96e15f7

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

Merge upstream/master into feat/exa-search-provider

c4154f2

sourcery-ai Bot reviewed Apr 10, 2026

View reviewed changes

piexian added 2 commits April 19, 2026 15:01

merge: resolve conflicts from upstream/master and adapt Exa tools to …

6275fc3

…config pattern

docs: simplify web search documentation to match upstream style

8e0a723

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Apr 19, 2026

piexian requested a review from Copilot April 19, 2026 07:11

Copilot started reviewing on behalf of piexian April 19, 2026 07:11 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Soulter force-pushed the master branch 2 times, most recently from faf411f to 0068960 Compare April 19, 2026 09:50

zouyonghe requested changes Apr 20, 2026

View reviewed changes

sourcery-ai Bot reviewed Apr 21, 2026

View reviewed changes

fix(websearch): include live refs and Exa favicons

0b915ba

zouyonghe self-requested a review April 21, 2026 01:16

merge upstream master into exa search provider

1803192

feat(exa): add content extraction error handling and param normalization

2e26d00

surface Exa content extraction status errors with URL and error tag details; extract count validation into reusable _normalize_count helper; pass llm_checkpoint_id through build_message_saved_event parameter

-        max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
+        raw_max_results = kwargs.get("max_results", 10)
+        try:
+            parsed_max_results = int(raw_max_results)
+        except (TypeError, ValueError):
+            parsed_max_results = 10
+        max_results = max(1, min(parsed_max_results, 100))


		Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.

		If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:

Uh oh!

Conversation

piexian commented Apr 4, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

动机

改动点

运行截图或测试结果

检查清单

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

zouyonghe commented Apr 10, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

piexian commented Apr 4, 2026 •

edited by sourcery-ai Bot

Loading

1. Inline `_iter_web_search_result_items` into `collect_web_search_ref_items`

2. Collapse `_extract_ref_indices` into `build_web_search_refs`

3. Simplify `collect_web_search_results` (or consider removing it)