Clarify abstract metrics and MCP framing

sjarmak · sjarmak · commit efbc0c09d817 · 2026-03-05T02:10:56.000Z
diff --git a/docs/technical_reports/TECHNICAL_REPORT.md b/docs/technical_reports/TECHNICAL_REPORT.md
@@ -7,7 +7,7 @@
 
 ## Abstract
 
-CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context-retrieval condition using Model Context Protocol (MCP) tools against a baseline condition with local source access and no external retrieval tools. In the analysis set used here (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org, and the report provides suite-level reward breakdowns across all 20 suites. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
+CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context retrieval condition using **Model Context Protocol (MCP)** tools against a baseline condition with local source access and no external retrieval tools. Results are reported using pair-normalized baseline vs MCP comparisons on matched tasks, with suite-level reward breakdowns across 20 suites. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. For retrieval quality on the curated analysis set (329 paired tasks), combined metrics improve from baseline to MCP as follows: **Precision@10 0.095 -> 0.313**, **Recall@10 0.120 -> 0.272**, and **F1@10 0.091 -> 0.240**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
 
 ---