Skip to content

Commit efbc0c0

Browse files
committed
Clarify abstract metrics and MCP framing
1 parent 9058565 commit efbc0c0

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/technical_reports/TECHNICAL_REPORT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
## Abstract
99

10-
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context-retrieval condition using Model Context Protocol (MCP) tools against a baseline condition with local source access and no external retrieval tools. In the analysis set used here (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org, and the report provides suite-level reward breakdowns across all 20 suites. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
10+
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context retrieval condition using **Model Context Protocol (MCP)** tools against a baseline condition with local source access and no external retrieval tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set (329 paired tasks), combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
1111

1212
---
1313

0 commit comments

Comments
 (0)