ci(bench): gate release benchmark on engine parity thresholds#1014
ci(bench): gate release benchmark on engine parity thresholds#1014carlos-alm merged 3 commits intomainfrom
Conversation
The Benchmark workflow already runs after every release (on workflow_run completion of Publish). Add a parity gate to the build-benchmark job so drift between the native and wasm engines fails the workflow — catching regressions the benchmark data would otherwise silently record. The gate runs after the benchmark doc PR is created, so the raw numbers still land in generated/benchmarks/BUILD-BENCHMARKS.md even when parity regresses; only the workflow status goes red to alert maintainers. Thresholds reference the currently-open parity bugs on v3.9.5: - File-set gap |wasm - native| ≤ 2 (#1011) - DB size ratio native/wasm ≤ 1.02 (#1010) - Full-build edges-phase ratio ≤ 1.30 (#1013) - Full-build roles-phase ratio ≤ 1.30 (#1013) - 1-file incremental ratio ≤ 1.50 (#1012) The gate writes a markdown table to \$GITHUB_STEP_SUMMARY showing pass/fail per threshold with a direct link to the tracking issue, so reviewers see the regression at a glance without digging through logs. No behavior change on the passing path — when both engines are within thresholds the step exits 0 silently. Impact: 2 functions changed, 2 affected
|
Claude finished @carlos-alm's task —— View job Reviewing PR #1014: ci(bench): gate release benchmark on engine parity thresholdsTodo List
Root Cause AnalysisProblem Being Solved: The benchmark workflow runs after releases but provides no automated feedback on engine parity regressions. The v3.9.5 benchmark failed with "Both engines failed. No results to report." due to timeouts, and even when successful, maintainers must manually check for parity issues across 5+ metrics between native and WASM engines. Why This Approach: Adding a parity gate as the final step preserves the existing documentation behavior (benchmark data still gets recorded in PRs) while providing automated failure signals to maintainers. The thresholds are calibrated tighter than current v3.9.5 numbers, ensuring the gate will surface known issues (#1010-#1013) until they're fixed. Risk Assessment: Low-risk addition. The gate runs after all documentation is complete, so failures don't prevent historical record-keeping. The script is defensive (handles missing engines gracefully) and the thresholds are well-documented with issue links. Configuration Change Detection (HIGH SCRUTINY)No concerning changes detected. The workflow addition is purely additive:
Test IntegrityNo test modifications. This PR adds CI validation but doesn't touch the test suite. The gate validates production benchmark results rather than test outputs, which is appropriate. Verification approach: Author smoke-tested locally against v3.9.5 data and confirmed all 5 checks fail as expected (showing the gate works). This is adequate for a simple validation script. Code Qualitybenchmark.yml (lines 204-206): scripts/benchmark-parity-gate.mjs:
✅ Clear data structures: Threshold checks are well-organized with descriptive names, limits, formatters, and issue tracking (lines 60-98) ✅ Appropriate output: Uses GitHub Actions summary alongside console output (lines 34-42), proper exit codes
✅ Security: No external dependencies, no shell execution, straightforward file I/O Backlog ComplianceZero-dep: ✅ Uses only Node.js built-ins ( Foundation-aligned: ✅ Fully compliant with all principles:
Problem-fit: 4/5 — Directly supports the core problem by preventing engine parity regressions that would cause AI assistants to get inconsistent results when switching engines. The CLAUDE.md explicitly states "Both engines must produce identical results" — this enforces that requirement. Breaking: No — Pure addition to CI workflow, no API or behavior changes Tier: 1 (zero-dep ✓ + foundation-aligned ✓) Critical ConcernsNone identified. The implementation is well-designed and follows project conventions. Final RecommendationRating: ⭐⭐⭐⭐☆ (4/5) Action: APPROVE Reasoning: This is a high-quality, well-thought-out addition that directly addresses a real operational need. The implementation is robust, well-documented, and follows project conventions. The gate design (running after doc updates) cleverly preserves historical data while providing automated feedback. The thresholds are appropriate and reference the specific issues they track. Why 4/5 instead of 5/5: While excellent overall, this is fundamentally infrastructure automation rather than breakthrough functionality. The minor division-by-zero handling could be cleaner, and the approach is straightforward rather than innovative. The PR strongly supports the project's dual-engine architecture goals and will provide valuable continuous integration feedback on engine parity — exactly what's needed as related issues #1009-#1013 are resolved. |
Greptile SummaryThis PR adds an engine parity gate to the existing release benchmark workflow: after the doc PR is opened, a new step runs Confidence Score: 4/5Safe to merge; the two remaining findings are both P2 quality/resilience suggestions that don't affect correctness on the happy path. No P0 or P1 issues remain. Two P2 findings: the implicit No files require special attention beyond the two P2 suggestions above. Important Files Changed
Sequence DiagramsequenceDiagram
participant P as Publish workflow
participant B as build-benchmark job
participant G as benchmark-parity-gate.mjs
participant S as $GITHUB_STEP_SUMMARY
P->>B: workflow_run (completed)
B->>B: Run build benchmark → benchmark-result.json
B->>B: Merge resolution result into benchmark-result.json
B->>B: Update build report
B->>B: Commit & push via PR (if changes exist)
B->>G: node benchmark-parity-gate.mjs benchmark-result.json
G->>G: Read & parse benchmark-result.json
alt wasm or native missing
G->>S: SKIP — gate passes (exit 0)
else both engines present
G->>G: Compute 5 parity checks
G->>S: Write markdown table
alt any threshold breached
G-->>B: exit 1 → job fails (alert maintainers)
else all pass
G-->>B: exit 0 → job succeeds
end
end
Reviews (2): Last reviewed commit: "fix: guard parity gate against missing/z..." | Re-trigger Greptile |
|
|
||
| line(`## Engine parity gate — v${version}`); | ||
| line(''); | ||
|
|
||
| if (!wasm || !native) { | ||
| const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', '); | ||
| line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`); | ||
| process.exit(1); |
There was a problem hiding this comment.
Missing-engine case exits 1, contradicting the inline comment
The script-level JSDoc on line 19 explicitly states: "If one engine failed, we leave the gate passing…" but process.exit(1) on line 50 does the opposite — it fails the gate. Because the gate runs after the doc PR step, a red job here won't block any data from being recorded, but it will fire a false-alarm in every run where only one engine finishes (e.g., during the window right before #1009 is fully deployed).
| line(`## Engine parity gate — v${version}`); | |
| line(''); | |
| if (!wasm || !native) { | |
| const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', '); | |
| line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`); | |
| process.exit(1); | |
| if (!wasm || !native) { | |
| const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', '); | |
| line(`**SKIP:** missing engine result for: ${missing}. Cannot assert parity — gate passes.`); | |
| process.exit(0); | |
| } |
There was a problem hiding this comment.
Fixed in b789411 — missing-engine branch now logs SKIP and exits 0, matching the documented pass-through intent in the file-level JSDoc.
| name: 'File-set gap (|wasm − native|)', | ||
| actual: Math.abs(wasm.files - native.files), | ||
| limit: 2, | ||
| formatter: (v) => String(v), | ||
| tracks: '#1011', |
There was a problem hiding this comment.
wasm.dbSizeBytes has no zero-guard, unlike every timing check
All four timing-based ratio checks defensively use Math.max(... ?? 1, 1) as the denominator. The DB-size ratio does not. If wasm.dbSizeBytes is 0 (empty DB from a partial wasm run), the result is Infinity, which fails the <= 1.02 threshold — a false positive. If it is undefined, the result is NaN, which also fails.
| name: 'File-set gap (|wasm − native|)', | |
| actual: Math.abs(wasm.files - native.files), | |
| limit: 2, | |
| formatter: (v) => String(v), | |
| tracks: '#1011', | |
| { | |
| name: 'DB size ratio (native / wasm)', | |
| actual: native.dbSizeBytes / Math.max(wasm.dbSizeBytes ?? 1, 1), | |
| limit: 1.02, | |
| formatter: (v) => v.toFixed(3), | |
| tracks: '#1010', | |
| }, |
There was a problem hiding this comment.
Fixed in b789411 — DB size ratio now uses (native.dbSizeBytes ?? 0) / Math.max(wasm.dbSizeBytes ?? 1, 1), matching the zero-guard pattern used by all four timing checks. Verified locally with wasm.dbSizeBytes = 0 — no more Infinity.
| // limit — ceiling; actual must be ≤ limit | ||
| // formatter — how to render the value | ||
| // tracks — related issue link shown on failure | ||
| const checks = [ | ||
| { |
There was a problem hiding this comment.
File-set gap check doesn't guard against
undefined fields
If either wasm.files or native.files is undefined (e.g., a partial result), Math.abs(undefined - undefined) produces NaN. NaN <= 2 is false, so the check silently fails as a false positive with no indication in the summary of why. Worth adding a ?? 0 fallback consistent with how the timing checks handle missing data.
| // limit — ceiling; actual must be ≤ limit | |
| // formatter — how to render the value | |
| // tracks — related issue link shown on failure | |
| const checks = [ | |
| { | |
| { | |
| name: 'File-set gap (|wasm − native|)', | |
| actual: Math.abs((wasm.files ?? 0) - (native.files ?? 0)), | |
| limit: 2, | |
| formatter: (v) => String(v), | |
| tracks: '#1011', | |
| }, |
There was a problem hiding this comment.
Fixed in b789411 — file-set gap now uses Math.abs((wasm.files ?? 0) - (native.files ?? 0)) so undefined fields no longer collapse to NaN.
| } | ||
|
|
||
| const result = JSON.parse(fs.readFileSync(resultFile, 'utf8')); |
There was a problem hiding this comment.
readFileSync throws an opaque stack trace on missing file
If benchmark-result.json doesn't exist (e.g., the build benchmark step was skipped by a prior failure), Node throws a raw ENOENT stack trace. Wrapping this in a try/catch gives maintainers a cleaner diagnostic message that matches the script's own error-reporting style.
| } | |
| const result = JSON.parse(fs.readFileSync(resultFile, 'utf8')); | |
| let result; | |
| try { | |
| result = JSON.parse(fs.readFileSync(resultFile, 'utf8')); | |
| } catch (err) { | |
| console.error(`Failed to read ${resultFile}: ${err.message}`); | |
| process.exit(2); | |
| } | |
| const { wasm, native, version } = result; |
There was a problem hiding this comment.
Fixed in b789411 — readFileSync/JSON.parse wrapped in try/catch, exits 2 with a clean Failed to read <file>: <err.message> diagnostic instead of a raw ENOENT stack trace.
Codegraph Impact Analysis5 functions changed → 6 callers affected across 3 files
|
- Missing engine now exits 0 (gate passes) to match the documented behavior in the script-level JSDoc, instead of firing a false-alarm red job when only one engine finishes. - Add nullish/zero guards to file-set gap and DB-size ratio checks so undefined or zero fields in a partial result no longer produce NaN or Infinity. - Wrap the result-file read in try/catch to print a clean diagnostic instead of a raw ENOENT stack trace. - Drop the unused path import.
Summary
The
Benchmarkworkflow already fires after every release viaworkflow_runonPublishcompletion. It ran for v3.9.5 on 2026-04-23 but failed silently withBoth engines failed. No results to report.(both workers hit the 600 s timeout — the hang fixed by #1009).This PR extends the release-triggered benchmark with an engine parity gate: after both engines run and the benchmark doc PR is opened, a new step validates parity thresholds against the merged
benchmark-result.jsonand fails the workflow if any threshold is breached.Behavior
generated/benchmarks/BUILD-BENCHMARKS.mdon every release — parity regressions don't wipe the historical record.$GITHUB_STEP_SUMMARYwith pass/fail per check and a link to the tracking issue, so reviewers see the regression at a glance.0silently when all thresholds pass. No churn on healthy releases.Thresholds
|wasm − native|)Limits are tighter than the current v3.9.5 numbers, so the gate will fail on the first release after #1009 merges — that's intentional. The failure is the dashboard: maintainers see which of the four open parity bugs are still biting each release. As each bug is fixed, the next release's gate steps closer to green.
Verification
Smoke-tested the gate script locally against the v3.9.5 benchmark numbers — all 5 checks correctly fail with the right issue links:
Passing case also verified with a hypothetical "all engines aligned" input — exits 0, summary shows all green.
Test plan
gh workflow run benchmark.yml -f version=3.9.5confirms gate fails with expected tableRelated
stringAST nodes vs WASM — 7.5% DB bloat in 3.9.5 #1010, bug(parity): native orchestrator still drops 49/743 files on 3.9.5 self-build (re-occurrence of #967) #1011, perf(native): 1-file incremental rebuild is 4.3x slower than WASM on 3.9.5 (876ms vs 203ms — regressed from #882) #1012, perf(native): full-buildedges+73% androles+334% slower than WASM on 3.9.5 despite producing fewer outputs #1013 — the four parity bugs the gate thresholds reference