Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Unlike benchmarks that test coding ability or factual recall, ResearchClawBench
<td align="center" width="25%">🔄<br/><b>Two-Stage Pipeline</b><br/><sub>Autonomous research + rigorous peer-review-style evaluation</sub></td>
<td align="center" width="25%">🧪<br/><b>40 Real-Science Tasks</b><br/><sub>10 disciplines, curated datasets from published papers</sub></td>
<td align="center" width="25%">👁️<br/><b>Expert-Annotated Data</b><br/><sub>Tasks, checklists & datasets curated by domain experts</sub></td>
<td align="center" width="25%">🤖<br/><b>Multi-Agent Support</b><br/><sub>Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchHarness & custom agents</sub></td>
<td align="center" width="25%">🤖<br/><b>Multi-Agent Support</b><br/><sub>Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchClaw, ResearchHarness & custom agents</sub></td>
</tr>
<tr>
<td align="center">🚀<br/><b>Re-Discovery to New-Discovery</b><br/><sub>50 = match the paper, 70+ = surpass it</sub></td>
Expand All @@ -60,11 +60,12 @@ Most AI benchmarks evaluate what models **know**. We evaluate what agents can **
- **Real science, not toy problems.** 40 tasks sourced from published papers across 10 disciplines, each with curated experimental datasets.
- **Two-stage pipeline.** Autonomous research first, rigorous evaluation second — just like peer review.
- **Fine-grained, multimodal scoring.** A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
- **Agent-agnostic.** Ships with built-in support for Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, and a lightweight ResearchHarness baseline. Bring your own agent in one line.
- **Agent-agnostic.** Ships with built-in support for Claude Code, Codex CLI, OpenClaw, Nanobot, EvoScientist, ResearchClaw, and a lightweight ResearchHarness baseline. Bring your own agent in one line.
- **From Re-Discovery to New-Discovery.** Scoring above 50 means matching the original paper; above 70 means *surpassing* it. The frontier is wide open.

### 📢 News

- **2026-04-10** 🔬 Added built-in [ResearchClaw](https://github.com/researchclaw/researchclaw) support — an intelligent agent-powered research assistant with built-in skills for paper search, literature review, and data analysis.
- **2026-04-07** 🧪 Added built-in [ResearchHarness](https://github.com/black-yt/ResearchHarness) support as a lightweight baseline agent for testing different LLMs under the same ResearchClawBench workflow.
- **2026-03-30** 🧬 Added built-in [EvoScientist](https://github.com/EvoScientist/EvoScientist) support and clarified multimodal judge prompting so the first attached image is explicitly treated as the ground-truth figure.
- **2026-03-27** 🤗 Released a Hugging Face dataset mirror at [InternScience/ResearchClawBench](https://huggingface.co/datasets/InternScience/ResearchClawBench), including 10 additional tasks from ResearchClawBench-Self and a task downloader script.
Expand Down Expand Up @@ -350,6 +351,7 @@ Install whichever agent(s) you plan to benchmark. You do not need all six.
| **OpenClaw** | [OpenClaw](https://openclaw.ai/) | Official website and setup entry |
| **Nanobot** | [HKUDS/nanobot](https://github.com/HKUDS/nanobot) | Official GitHub repository |
| **EvoScientist** | [EvoScientist/EvoScientist](https://github.com/EvoScientist/EvoScientist) | Official GitHub repository |
| **ResearchClaw** | [researchclaw/researchclaw](https://github.com/researchclaw/researchclaw) | `pip install researchclaw` |
| **ResearchHarness** | [black-yt/ResearchHarness benchmark README](https://github.com/black-yt/ResearchHarness/blob/main/benchmarks/ResearchClawBench/README.md) | Lightweight baseline harness for testing different LLMs; replace `/abs/path/to/ResearchHarness` in `agents.json` |

#### 5. Launch
Expand All @@ -366,7 +368,7 @@ After a run completes, switch to the **Evaluation** tab and click **Score**. The

### 🤖 Supported Agents

ResearchClawBench ships with built-in support for five frontier coding agents plus a lightweight ResearchHarness baseline:
ResearchClawBench ships with built-in support for five frontier coding agents, ResearchClaw, plus a lightweight ResearchHarness baseline:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The count of supported frontier coding agents should be updated to 'six' now that ResearchClaw has been added. The current phrasing 'five frontier coding agents, ResearchClaw,' is grammatically inconsistent and less concise than the previous format.

Suggested change
ResearchClawBench ships with built-in support for five frontier coding agents, ResearchClaw, plus a lightweight ResearchHarness baseline:
ResearchClawBench ships with built-in support for six frontier coding agents plus a lightweight ResearchHarness baseline:


| Agent | Command | Notes |
|:------|:--------|:------|
Expand All @@ -375,6 +377,7 @@ ResearchClawBench ships with built-in support for five frontier coding agents pl
| <img src="evaluation/static/logos/openclaw.svg" width="16" /> **OpenClaw** | `openclaw agent ...` | Self-hosted gateway, 3600s timeout |
| <img src="evaluation/static/logos/nanobot.svg" width="16" /> **Nanobot** | `nanobot agent -m ...` | Ultra-lightweight, reliable tool execution |
| <img src="evaluation/static/logos/evo.svg" width="16" /> **EvoScientist** | `evosci --ui cli ...` | Self-evolving AI Scientists |
| <img src="evaluation/static/logos/researchclaw.svg" width="16" /> **ResearchClaw** | `researchclaw agent -m ...` | AI research assistant with built-in skills |
| <img src="evaluation/static/logos/rh.svg" width="16" /> **ResearchHarness** | `python3 /abs/path/to/ResearchHarness/run_agent.py ...` | Lightweight baseline harness for testing different LLMs |

#### 🔧 Add Your Own Agent
Expand Down
6 changes: 6 additions & 0 deletions evaluation/agents.json
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,11 @@
"icon": "H",
"logo": "/static/logos/rh.svg",
"cmd": "python3 /abs/path/to/ResearchHarness/run_agent.py <PROMPT> --workspace-root <WORKSPACE> --role-prompt-file /abs/path/to/ResearchHarness/benchmarks/ResearchClawBench/role_prompt.md --trace-dir <WORKSPACE>"
},
"researchclaw": {
"label": "ResearchClaw",
"icon": "R",
"logo": "/static/logos/researchclaw.svg",
"cmd": "researchclaw agent -m <PROMPT> -w <WORKSPACE>"
}
}
4 changes: 4 additions & 0 deletions evaluation/static/logos/researchclaw.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.