Skip to content

Commit 07735ed

Browse files
sjarmakclaude
andcommitted
fix: correct stale metadata + complete SDLC variance coverage
- Fix ccb_understand per_suite count 34→20 (was stale after narrowing) - Fix total_selected 400→398 to match actual task count - Add 3rd variance pass for openlibrary-search-query-fix-001 and pytorch-tracer-graph-cleanup-fix-001 (promoted fix_haiku_20260301_190026) - All 178 SDLC tasks now have >= 3 runs in both baseline and MCP configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0c856b7 commit 07735ed

File tree

1,258 files changed

+1420384
-39331
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,258 files changed

+1420384
-39331
lines changed

configs/selected_benchmark_tasks.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"generated_by": "SDLC suite migration from migration_map.json",
66
"generated_date": "2026-02-18",
77
"total_available": 835,
8-
"total_selected": 400,
8+
"total_selected": 398,
99
"migration_source": "migration_map.json (157 mapped tasks across 8 SDLC suites)",
1010
"target_total": 294,
1111
"target_note": "178 SDLC tasks (9 suites, ccb_test=18 after TAC task removal) + 220 MCP-unique tasks (11 suites x 20) = 398 active. 2 llamacpp TAC tasks dropped (need external RocketChat server).",
@@ -167,7 +167,7 @@
167167
"ccb_refactor": 20,
168168
"ccb_secure": 20,
169169
"ccb_test": 18,
170-
"ccb_understand": 34
170+
"ccb_understand": 20
171171
}
172172
},
173173
"tasks": [

docs/official_results/README.md

Lines changed: 53 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-03-01T13:28:24.423549+00:00`
5+
Generated: `2026-03-01T19:12:02.665318+00:00`
66

77
## Local Browse
88

@@ -19,16 +19,16 @@ Historical reruns/backfills remain available in `data/official_results.json` und
1919
|---|---|---:|---:|---:|---:|---|
2020
| [ccb_build](suites/ccb_build.md) | `baseline-local-direct` | 23 | 23 | 0.580 | 0.783 | ok |
2121
| [ccb_build](suites/ccb_build.md) | `mcp-remote-direct` | 20 | 23 | 0.592 | 0.800 | FLAG: below minimum |
22-
| [ccb_debug](suites/ccb_debug.md) | `baseline-local-direct` | 16 | 20 | 0.739 | 1.000 | FLAG: below minimum |
23-
| [ccb_debug](suites/ccb_debug.md) | `mcp-remote-direct` | 16 | 20 | 0.559 | 0.688 | FLAG: below minimum |
24-
| [ccb_design](suites/ccb_design.md) | `baseline-local-direct` | 20 | 20 | 0.766 | 1.000 | ok |
25-
| [ccb_design](suites/ccb_design.md) | `mcp-remote-direct` | 33 | 20 | 0.741 | 1.000 | ok |
26-
| [ccb_document](suites/ccb_document.md) | `baseline-local-direct` | 20 | 20 | 0.890 | 1.000 | ok |
27-
| [ccb_document](suites/ccb_document.md) | `mcp-remote-direct` | 44 | 20 | 0.841 | 1.000 | ok |
22+
| [ccb_debug](suites/ccb_debug.md) | `baseline-local-direct` | 20 | 20 | 0.688 | 1.000 | ok |
23+
| [ccb_debug](suites/ccb_debug.md) | `mcp-remote-direct` | 25 | 20 | 0.510 | 0.720 | ok |
24+
| [ccb_design](suites/ccb_design.md) | `baseline-local-direct` | 20 | 20 | 0.770 | 1.000 | ok |
25+
| [ccb_design](suites/ccb_design.md) | `mcp-remote-direct` | 33 | 20 | 0.720 | 0.970 | ok |
26+
| [ccb_document](suites/ccb_document.md) | `baseline-local-direct` | 20 | 20 | 0.845 | 1.000 | ok |
27+
| [ccb_document](suites/ccb_document.md) | `mcp-remote-direct` | 44 | 20 | 0.875 | 1.000 | ok |
2828
| [ccb_feature](suites/ccb_feature.md) | `baseline-local-direct` | 20 | 20 | 0.631 | 0.850 | ok |
2929
| [ccb_feature](suites/ccb_feature.md) | `mcp-remote-direct` | 20 | 20 | 0.553 | 0.800 | ok |
3030
| [ccb_fix](suites/ccb_fix.md) | `baseline-local-direct` | 25 | 25 | 0.450 | 0.600 | ok |
31-
| [ccb_fix](suites/ccb_fix.md) | `mcp-remote-direct` | 70 | 25 | 0.572 | 0.714 | ok |
31+
| [ccb_fix](suites/ccb_fix.md) | `mcp-remote-direct` | 72 | 25 | 0.556 | 0.694 | ok |
3232
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-artifact` | 1 | 28 | 0.375 | 1.000 | FLAG: below minimum |
3333
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-direct` | 7 | 28 | 0.648 | 1.000 | FLAG: below minimum |
3434
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `mcp-remote-artifact` | 1 | 28 | 0.742 | 1.000 | FLAG: below minimum |
@@ -65,13 +65,13 @@ Historical reruns/backfills remain available in `data/official_results.json` und
6565
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline-local-direct` | 10 | 25 | 0.524 | 0.800 | FLAG: below minimum |
6666
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-artifact` | 6 | 25 | 0.792 | 1.000 | FLAG: below minimum |
6767
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-direct` | 25 | 25 | 0.719 | 1.000 | ok |
68-
| [ccb_refactor](suites/ccb_refactor.md) | `baseline-local-direct` | 20 | 20 | 0.755 | 1.000 | ok |
69-
| [ccb_refactor](suites/ccb_refactor.md) | `mcp-remote-direct` | 20 | 20 | 0.671 | 1.000 | ok |
70-
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 20 | 0.669 | 0.950 | ok |
71-
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 24 | 20 | 0.637 | 0.917 | ok |
72-
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 20 | 0.480 | 0.750 | ok |
73-
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 35 | 20 | 0.475 | 0.686 | ok |
74-
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 34 | 20 | 0.771 | 0.853 | ok |
68+
| [ccb_refactor](suites/ccb_refactor.md) | `baseline-local-direct` | 20 | 20 | 0.789 | 0.950 | ok |
69+
| [ccb_refactor](suites/ccb_refactor.md) | `mcp-remote-direct` | 20 | 20 | 0.703 | 1.000 | ok |
70+
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 20 | 0.712 | 1.000 | ok |
71+
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 24 | 20 | 0.707 | 0.958 | ok |
72+
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 20 | 0.484 | 0.700 | ok |
73+
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 35 | 20 | 0.468 | 0.686 | ok |
74+
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 34 | 20 | 0.902 | 0.971 | ok |
7575
| [ccb_understand](suites/ccb_understand.md) | `mcp-remote-direct` | 48 | 20 | 0.873 | 0.979 | ok |
7676

7777
<details>
@@ -317,13 +317,35 @@ Historical reruns/backfills remain available in `data/official_results.json` und
317317
| [debug_haiku_20260228_231033](runs/debug_haiku_20260228_231033.md) | `ccb_debug` | `mcp-remote-direct` | 10 | 0.804 | 1.000 |
318318
| [debug_haiku_20260301_021540](runs/debug_haiku_20260301_021540.md) | `ccb_debug` | `baseline-local-direct` | 11 | 0.847 | 1.000 |
319319
| [debug_haiku_20260301_021540](runs/debug_haiku_20260301_021540.md) | `ccb_debug` | `mcp-remote-direct` | 11 | 0.813 | 1.000 |
320+
| [debug_haiku_20260301_030159](runs/debug_haiku_20260301_030159.md) | `ccb_debug` | `baseline-local-direct` | 11 | 0.837 | 1.000 |
321+
| [debug_haiku_20260301_030159](runs/debug_haiku_20260301_030159.md) | `ccb_debug` | `mcp-remote-direct` | 11 | 0.801 | 1.000 |
322+
| [debug_haiku_20260301_031844](runs/debug_haiku_20260301_031844.md) | `ccb_debug` | `baseline-local-direct` | 11 | 0.806 | 1.000 |
323+
| [debug_haiku_20260301_031844](runs/debug_haiku_20260301_031844.md) | `ccb_debug` | `mcp-remote-direct` | 11 | 0.750 | 1.000 |
324+
| [debug_haiku_20260301_033225](runs/debug_haiku_20260301_033225.md) | `ccb_debug` | `baseline-local-direct` | 9 | 0.444 | 0.889 |
325+
| [debug_haiku_20260301_033225](runs/debug_haiku_20260301_033225.md) | `ccb_debug` | `mcp-remote-direct` | 9 | 0.389 | 0.778 |
326+
| [debug_haiku_20260301_035030](runs/debug_haiku_20260301_035030.md) | `ccb_debug` | `baseline-local-direct` | 9 | 0.333 | 0.667 |
327+
| [debug_haiku_20260301_035030](runs/debug_haiku_20260301_035030.md) | `ccb_debug` | `mcp-remote-direct` | 9 | 0.278 | 0.556 |
328+
| [debug_haiku_20260301_040300](runs/debug_haiku_20260301_040300.md) | `ccb_debug` | `baseline-local-direct` | 9 | 0.500 | 1.000 |
329+
| [debug_haiku_20260301_040300](runs/debug_haiku_20260301_040300.md) | `ccb_debug` | `mcp-remote-direct` | 9 | 0.389 | 0.778 |
330+
| [debug_haiku_20260301_071226](runs/debug_haiku_20260301_071226.md) | `ccb_debug` | `baseline-local-direct` | 11 | 0.842 | 1.000 |
331+
| [debug_haiku_20260301_071226](runs/debug_haiku_20260301_071226.md) | `ccb_debug` | `mcp-remote-direct` | 11 | 0.841 | 1.000 |
320332
| [design_haiku_20260223_124652](runs/design_haiku_20260223_124652.md) | `ccb_design` | `baseline-local-direct` | 13 | 0.770 | 1.000 |
321333
| [design_haiku_20260223_124652](runs/design_haiku_20260223_124652.md) | `ccb_design` | `mcp-remote-direct` | 20 | 0.718 | 1.000 |
322334
| [design_haiku_20260301_022406](runs/design_haiku_20260301_022406.md) | `ccb_design` | `baseline-local-direct` | 20 | 0.766 | 1.000 |
323335
| [design_haiku_20260301_022406](runs/design_haiku_20260301_022406.md) | `ccb_design` | `mcp-remote-direct` | 20 | 0.734 | 1.000 |
336+
| [design_haiku_20260301_031030](runs/design_haiku_20260301_031030.md) | `ccb_design` | `baseline-local-direct` | 20 | 0.762 | 0.950 |
337+
| [design_haiku_20260301_031030](runs/design_haiku_20260301_031030.md) | `ccb_design` | `mcp-remote-direct` | 20 | 0.747 | 1.000 |
338+
| [design_haiku_20260301_031845](runs/design_haiku_20260301_031845.md) | `ccb_design` | `baseline-local-direct` | 20 | 0.807 | 1.000 |
339+
| [design_haiku_20260301_031845](runs/design_haiku_20260301_031845.md) | `ccb_design` | `mcp-remote-direct` | 19 | 0.701 | 1.000 |
340+
| [design_haiku_20260301_071227](runs/design_haiku_20260301_071227.md) | `ccb_design` | `baseline-local-direct` | 20 | 0.770 | 1.000 |
341+
| [design_haiku_20260301_071227](runs/design_haiku_20260301_071227.md) | `ccb_design` | `mcp-remote-direct` | 20 | 0.699 | 0.950 |
324342
| [document_haiku_20260223_164240](runs/document_haiku_20260223_164240.md) | `ccb_document` | `baseline-local-direct` | 19 | 0.851 | 1.000 |
325343
| [document_haiku_20260223_164240](runs/document_haiku_20260223_164240.md) | `ccb_document` | `mcp-remote-direct` | 20 | 0.822 | 1.000 |
326344
| [document_haiku_20260226_013910](runs/document_haiku_20260226_013910.md) | `ccb_document` | `baseline-local-direct` | 1 | 1.000 | 1.000 |
345+
| [document_haiku_20260301_031846](runs/document_haiku_20260301_031846.md) | `ccb_document` | `baseline-local-direct` | 20 | 0.875 | 1.000 |
346+
| [document_haiku_20260301_031846](runs/document_haiku_20260301_031846.md) | `ccb_document` | `mcp-remote-direct` | 20 | 0.908 | 1.000 |
347+
| [document_haiku_20260301_071228](runs/document_haiku_20260301_071228.md) | `ccb_document` | `baseline-local-direct` | 20 | 0.845 | 1.000 |
348+
| [document_haiku_20260301_071228](runs/document_haiku_20260301_071228.md) | `ccb_document` | `mcp-remote-direct` | 20 | 0.898 | 1.000 |
327349
| [feature_haiku_20260228_190114](runs/feature_haiku_20260228_190114.md) | `ccb_feature` | `baseline-local-direct` | 5 | 0.507 | 0.600 |
328350
| [feature_haiku_20260228_190114](runs/feature_haiku_20260228_190114.md) | `ccb_feature` | `mcp-remote-direct` | 6 | 0.550 | 0.833 |
329351
| [feature_haiku_20260228_211127](runs/feature_haiku_20260228_211127.md) | `ccb_feature` | `baseline-local-direct` | 17 | 0.664 | 0.941 |
@@ -343,6 +365,8 @@ Historical reruns/backfills remain available in `data/official_results.json` und
343365
| [feature_haiku_20260301_071229](runs/feature_haiku_20260301_071229.md) | `ccb_feature` | `baseline-local-direct` | 20 | 0.631 | 0.850 |
344366
| [feature_haiku_20260301_071229](runs/feature_haiku_20260301_071229.md) | `ccb_feature` | `mcp-remote-direct` | 19 | 0.582 | 0.842 |
345367
| [feature_haiku_vscode_rerun_20260301_023018](runs/feature_haiku_vscode_rerun_20260301_023018.md) | `ccb_feature` | `baseline-local-direct` | 1 | 0.500 | 1.000 |
368+
| [fix_haiku_20260301_190026](runs/fix_haiku_20260301_190026.md) | `ccb_fix` | `baseline-local-direct` | 2 | 0.000 | 0.000 |
369+
| [fix_haiku_20260301_190026](runs/fix_haiku_20260301_190026.md) | `ccb_fix` | `mcp-remote-direct` | 2 | 0.000 | 0.000 |
346370
| [refactor_haiku_20260228_210652](runs/refactor_haiku_20260228_210652.md) | `ccb_refactor` | `baseline-local-direct` | 1 | 0.750 | 1.000 |
347371
| [refactor_haiku_20260228_210652](runs/refactor_haiku_20260228_210652.md) | `ccb_refactor` | `mcp-remote-direct` | 1 | 0.790 | 1.000 |
348372
| [refactor_haiku_20260228_230116](runs/refactor_haiku_20260228_230116.md) | `ccb_refactor` | `mcp-remote-direct` | 6 | 0.585 | 1.000 |
@@ -355,19 +379,33 @@ Historical reruns/backfills remain available in `data/official_results.json` und
355379
| [refactor_haiku_20260301_023530](runs/refactor_haiku_20260301_023530.md) | `ccb_refactor` | `mcp-remote-direct` | 10 | 0.717 | 0.900 |
356380
| [refactor_haiku_20260301_031849](runs/refactor_haiku_20260301_031849.md) | `ccb_refactor` | `baseline-local-direct` | 20 | 0.755 | 1.000 |
357381
| [refactor_haiku_20260301_031849](runs/refactor_haiku_20260301_031849.md) | `ccb_refactor` | `mcp-remote-direct` | 20 | 0.671 | 1.000 |
382+
| [refactor_haiku_20260301_071230](runs/refactor_haiku_20260301_071230.md) | `ccb_refactor` | `baseline-local-direct` | 20 | 0.789 | 0.950 |
383+
| [refactor_haiku_20260301_071230](runs/refactor_haiku_20260301_071230.md) | `ccb_refactor` | `mcp-remote-direct` | 19 | 0.713 | 1.000 |
358384
| [secure_haiku_20260223_232545](runs/secure_haiku_20260223_232545.md) | `ccb_secure` | `baseline-local-direct` | 20 | 0.669 | 0.950 |
359385
| [secure_haiku_20260223_232545](runs/secure_haiku_20260223_232545.md) | `ccb_secure` | `mcp-remote-direct` | 18 | 0.705 | 1.000 |
360386
| [secure_haiku_20260224_011825](runs/secure_haiku_20260224_011825.md) | `ccb_secure` | `mcp-remote-direct` | 2 | 0.500 | 0.500 |
387+
| [secure_haiku_20260301_031850](runs/secure_haiku_20260301_031850.md) | `ccb_secure` | `baseline-local-direct` | 20 | 0.737 | 0.950 |
388+
| [secure_haiku_20260301_031850](runs/secure_haiku_20260301_031850.md) | `ccb_secure` | `mcp-remote-direct` | 20 | 0.728 | 1.000 |
389+
| [secure_haiku_20260301_071231](runs/secure_haiku_20260301_071231.md) | `ccb_secure` | `baseline-local-direct` | 20 | 0.712 | 1.000 |
390+
| [secure_haiku_20260301_071231](runs/secure_haiku_20260301_071231.md) | `ccb_secure` | `mcp-remote-direct` | 20 | 0.767 | 1.000 |
361391
| [test_haiku_20260223_235732](runs/test_haiku_20260223_235732.md) | `ccb_test` | `baseline-local-direct` | 10 | 0.492 | 0.800 |
362392
| [test_haiku_20260223_235732](runs/test_haiku_20260223_235732.md) | `ccb_test` | `mcp-remote-direct` | 19 | 0.495 | 0.684 |
363393
| [test_haiku_20260224_011816](runs/test_haiku_20260224_011816.md) | `ccb_test` | `baseline-local-direct` | 11 | 0.295 | 0.545 |
364394
| [test_haiku_20260224_011816](runs/test_haiku_20260224_011816.md) | `ccb_test` | `mcp-remote-direct` | 11 | 0.262 | 0.455 |
365395
| [test_haiku_20260228_230654](runs/test_haiku_20260228_230654.md) | `ccb_test` | `mcp-remote-direct` | 1 | 0.000 | 0.000 |
366396
| [test_haiku_20260228_231039](runs/test_haiku_20260228_231039.md) | `ccb_test` | `mcp-remote-direct` | 1 | 0.200 | 1.000 |
397+
| [test_haiku_20260301_031851](runs/test_haiku_20260301_031851.md) | `ccb_test` | `baseline-local-direct` | 17 | 0.571 | 0.824 |
398+
| [test_haiku_20260301_031851](runs/test_haiku_20260301_031851.md) | `ccb_test` | `mcp-remote-direct` | 8 | 0.769 | 1.000 |
399+
| [test_haiku_20260301_071232](runs/test_haiku_20260301_071232.md) | `ccb_test` | `baseline-local-direct` | 17 | 0.569 | 0.824 |
400+
| [test_haiku_20260301_071232](runs/test_haiku_20260301_071232.md) | `ccb_test` | `mcp-remote-direct` | 8 | 0.780 | 1.000 |
367401
| [understand_haiku_20260224_001815](runs/understand_haiku_20260224_001815.md) | `ccb_understand` | `baseline-local-direct` | 20 | 0.533 | 0.650 |
368402
| [understand_haiku_20260224_001815](runs/understand_haiku_20260224_001815.md) | `ccb_understand` | `mcp-remote-direct` | 20 | 0.679 | 0.850 |
369403
| [understand_haiku_20260225_211346](runs/understand_haiku_20260225_211346.md) | `ccb_understand` | `baseline-local-direct` | 7 | 0.789 | 1.000 |
370404
| [understand_haiku_20260225_211346](runs/understand_haiku_20260225_211346.md) | `ccb_understand` | `mcp-remote-direct` | 7 | 0.870 | 1.000 |
405+
| [understand_haiku_20260301_031852](runs/understand_haiku_20260301_031852.md) | `ccb_understand` | `baseline-local-direct` | 20 | 0.728 | 0.850 |
406+
| [understand_haiku_20260301_031852](runs/understand_haiku_20260301_031852.md) | `ccb_understand` | `mcp-remote-direct` | 20 | 0.832 | 0.950 |
407+
| [understand_haiku_20260301_071233](runs/understand_haiku_20260301_071233.md) | `ccb_understand` | `baseline-local-direct` | 20 | 0.884 | 1.000 |
408+
| [understand_haiku_20260301_071233](runs/understand_haiku_20260301_071233.md) | `ccb_understand` | `mcp-remote-direct` | 20 | 0.850 | 1.000 |
371409

372410
</details>
373411

0 commit comments

Comments
 (0)