diff --git a/README.md b/README.md index 4d03736..ca85496 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Unlike benchmarks that test coding ability or factual recall, ResearchClawBench ๐Ÿ”„
Two-Stage Pipeline
Autonomous research + rigorous peer-review-style evaluation ๐Ÿงช
40 Real-Science Tasks
10 disciplines, curated datasets from published papers ๐Ÿ‘๏ธ
Expert-Annotated Data
Tasks, checklists & datasets curated by domain experts -๐Ÿค–
Multi-Agent Support
Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchHarness & custom agents +๐Ÿค–
Multi-Agent Support
Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchClaw, ResearchHarness & custom agents ๐Ÿš€
Re-Discovery to New-Discovery
50 = match the paper, 70+ = surpass it @@ -60,11 +60,12 @@ Most AI benchmarks evaluate what models **know**. We evaluate what agents can ** - **Real science, not toy problems.** 40 tasks sourced from published papers across 10 disciplines, each with curated experimental datasets. - **Two-stage pipeline.** Autonomous research first, rigorous evaluation second โ€” just like peer review. - **Fine-grained, multimodal scoring.** A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer. -- **Agent-agnostic.** Ships with built-in support for Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, and a lightweight ResearchHarness baseline. Bring your own agent in one line. +- **Agent-agnostic.** Ships with built-in support for Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchClaw, and a lightweight ResearchHarness baseline. Bring your own agent in one line. - **From Re-Discovery to New-Discovery.** Scoring above 50 means matching the original paper; above 70 means *surpassing* it. The frontier is wide open. ### ๐Ÿ“ข News +- **2026-04-10** ๐Ÿ”ฌ Added built-in [ResearchClaw](https://github.com/researchclaw/researchclaw) support โ€” an intelligent agent-powered research assistant with built-in skills for paper search, literature review, and data analysis. - **2026-04-07** ๐Ÿงช Added built-in [ResearchHarness](https://github.com/black-yt/ResearchHarness) support as a lightweight baseline agent for testing different LLMs under the same ResearchClawBench workflow. - **2026-03-30** ๐Ÿงฌ Added built-in [EvoScientist](https://github.com/EvoScientist/EvoScientist) support and clarified multimodal judge prompting so the first attached image is explicitly treated as the ground-truth figure. - **2026-03-27** ๐Ÿค— Released a Hugging Face dataset mirror at [InternScience/ResearchClawBench](https://huggingface.co/datasets/InternScience/ResearchClawBench), including 10 additional tasks from ResearchClawBench-Self and a task downloader script. @@ -350,6 +351,7 @@ Install whichever agent(s) you plan to benchmark. You do not need all six. | **OpenClaw** | [OpenClaw](https://openclaw.ai/) | Official website and setup entry | | **Nanobot** | [HKUDS/nanobot](https://github.com/HKUDS/nanobot) | Official GitHub repository | | **EvoScientist** | [EvoScientist/EvoScientist](https://github.com/EvoScientist/EvoScientist) | Official GitHub repository | +| **ResearchClaw** | [researchclaw/researchclaw](https://github.com/researchclaw/researchclaw) | `pip install researchclaw` | | **ResearchHarness** | [black-yt/ResearchHarness benchmark README](https://github.com/black-yt/ResearchHarness/blob/main/benchmarks/ResearchClawBench/README.md) | Lightweight baseline harness for testing different LLMs; replace `/abs/path/to/ResearchHarness` in `agents.json` | #### 5. Launch @@ -366,7 +368,7 @@ After a run completes, switch to the **Evaluation** tab and click **Score**. The ### ๐Ÿค– Supported Agents -ResearchClawBench ships with built-in support for five frontier coding agents plus a lightweight ResearchHarness baseline: +ResearchClawBench ships with built-in support for five frontier coding agents, ResearchClaw, plus a lightweight ResearchHarness baseline: | Agent | Command | Notes | |:------|:--------|:------| @@ -375,6 +377,7 @@ ResearchClawBench ships with built-in support for five frontier coding agents pl | **OpenClaw** | `openclaw agent ...` | Self-hosted gateway, 3600s timeout | | **Nanobot** | `nanobot agent -m ...` | Ultra-lightweight, reliable tool execution | | **EvoScientist** | `evosci --ui cli ...` | Self-evolving AI Scientists | +| **ResearchClaw** | `researchclaw agent -m ...` | AI research assistant with built-in skills | | **ResearchHarness** | `python3 /abs/path/to/ResearchHarness/run_agent.py ...` | Lightweight baseline harness for testing different LLMs | #### ๐Ÿ”ง Add Your Own Agent diff --git a/evaluation/agents.json b/evaluation/agents.json index 85de0ed..afc68e4 100644 --- a/evaluation/agents.json +++ b/evaluation/agents.json @@ -34,5 +34,11 @@ "icon": "H", "logo": "/static/logos/rh.svg", "cmd": "python3 /abs/path/to/ResearchHarness/run_agent.py --workspace-root --role-prompt-file /abs/path/to/ResearchHarness/benchmarks/ResearchClawBench/role_prompt.md --trace-dir " + }, + "researchclaw": { + "label": "ResearchClaw", + "icon": "R", + "logo": "/static/logos/researchclaw.svg", + "cmd": "researchclaw agent -m -w " } } diff --git a/evaluation/static/logos/researchclaw.svg b/evaluation/static/logos/researchclaw.svg new file mode 100644 index 0000000..aa98b37 --- /dev/null +++ b/evaluation/static/logos/researchclaw.svg @@ -0,0 +1,4 @@ + + + RC +