This table summarizes the current comparator evidence from results/comparator-evidence.json.
It is a setup-status table first, not a marketing scoreboard.
| Comparator | Intended role in gate | Current status | Evidence summary |
|---|---|---|---|
raw Claude Code |
Baseline for payload cost and at least one usefulness comparison | comparator artifact: ok; gate: pending_evidence |
The Haiku-backed Claude CLI runner now returns current payloads, but the checked-in baseline still has averageFirstRelevantHit: null, so the gate still records missing baseline metrics. |
GrepAI |
Named MCP comparator | setup_failed |
Requires the GrepAI binary plus a local Ollama embedding setup that is not present in this proof environment. |
jCodeMunch |
Named MCP comparator | setup_failed |
The MCP server still closes on startup during the current rerun, so no comparable discovery metrics were produced. |
codebase-memory-mcp |
Named MCP comparator | comparator artifact: ok; gate: failed |
The repaired graph-backed runner now produces real current metrics, but the frozen gate still fails this lane because codebase-context does not stay within tolerance on every required usefulness metric. |
CodeGraphContext |
Graph-native comparator in the relaunch frame | setup_failed |
The MCP server still closes on startup during the current rerun, so this lane remains missing evidence. |
setup_failedmeans the lane was attempted and did not reach a credible metric-producing state.pending_evidencein the gate means the lane is still missing one or more required metrics.failedin the gate means the lane has real metrics, but the frozen comparison rule still does not pass.- A missing metric is not treated as a win for
codebase-context. - The combined gate in
results/gate-evaluation.jsonremainspending_evidence, andclaimAllowedstaysfalse, until these lanes produce real metrics.
For reference, the current combined discovery output across angular-spotify and excalidraw is:
| Metric | codebase-context |
|---|---|
totalTasks |
24 |
averageUsefulness |
0.75 |
averagePayloadBytes |
7306.4583 |
averageEstimatedTokens |
1827.0833 |
bestExampleUsefulnessRate |
0.125 |
gate.status |
pending_evidence |
Those numbers are not compared here as head-to-head wins because the comparator lanes above did not produce matching metrics.