-CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context-retrieval condition using Model Context Protocol (MCP) tools against a baseline condition with local source access and no external retrieval tools. In the analysis set used here (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org, and the report provides suite-level reward breakdowns across all 20 suites. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
0 commit comments