ci(bench): gate release benchmark on engine parity thresholds by carlos-alm · Pull Request #1014 · optave/ops-codegraph-tool

carlos-alm · 2026-04-24T06:27:07Z

Summary

The Benchmark workflow already fires after every release via workflow_run on Publish completion. It ran for v3.9.5 on 2026-04-23 but failed silently with Both engines failed. No results to report. (both workers hit the 600 s timeout — the hang fixed by #1009).

This PR extends the release-triggered benchmark with an engine parity gate: after both engines run and the benchmark doc PR is opened, a new step validates parity thresholds against the merged benchmark-result.json and fails the workflow if any threshold is breached.

Behavior

Runs after the doc PR is created, so raw numbers still land in generated/benchmarks/BUILD-BENCHMARKS.md on every release — parity regressions don't wipe the historical record.
Writes a markdown table to $GITHUB_STEP_SUMMARY with pass/fail per check and a link to the tracking issue, so reviewers see the regression at a glance.
Exits 0 silently when all thresholds pass. No churn on healthy releases.

Thresholds

Check	Limit	Tracks
File-set gap (`\|wasm − native\|`)	≤ 2	#1011
DB size ratio (native / wasm)	≤ 1.02	#1010
Full-build edges-phase ratio	≤ 1.30	#1013
Full-build roles-phase ratio	≤ 1.30	#1013
1-file incremental ratio	≤ 1.50	#1012

Limits are tighter than the current v3.9.5 numbers, so the gate will fail on the first release after #1009 merges — that's intentional. The failure is the dashboard: maintainers see which of the four open parity bugs are still biting each release. As each bug is fixed, the next release's gate steps closer to green.

Verification

Smoke-tested the gate script locally against the v3.9.5 benchmark numbers — all 5 checks correctly fail with the right issue links:

| File-set gap (|wasm − native|) | 49   | 2    | :x: fail | #1011 |
| DB size ratio (native / wasm)  | 1.075| 1.020| :x: fail | #1010 |
| Full-build edges-phase ratio   | 1.74 | 1.30 | :x: fail | #1013 |
| Full-build roles-phase ratio   | 4.35 | 1.30 | :x: fail | #1013 |
| 1-file incremental ratio       | 4.32 | 1.50 | :x: fail | #1012 |

Passing case also verified with a hypothetical "all engines aligned" input — exits 0, summary shows all green.

Test plan

Manual dispatch via gh workflow run benchmark.yml -f version=3.9.5 confirms gate fails with expected table
After fix(bench): dispose WASM worker pool and keep progress off stdout #1009 merges and the next release lands, confirm the gate runs and the doc PR still opens even on gate failure

fix(bench): dispose WASM worker pool and keep progress off stdout #1009 — worker-hang fix (blocks this from ever producing useful data until merged)
bug(parity): native engine emits ~7,200 excess string AST nodes vs WASM — 7.5% DB bloat in 3.9.5 #1010, bug(parity): native orchestrator still drops 49/743 files on 3.9.5 self-build (re-occurrence of #967) #1011, perf(native): 1-file incremental rebuild is 4.3x slower than WASM on 3.9.5 (876ms vs 203ms — regressed from #882) #1012, perf(native): full-build edges +73% and roles +334% slower than WASM on 3.9.5 despite producing fewer outputs #1013 — the four parity bugs the gate thresholds reference

The Benchmark workflow already runs after every release (on workflow_run completion of Publish). Add a parity gate to the build-benchmark job so drift between the native and wasm engines fails the workflow — catching regressions the benchmark data would otherwise silently record. The gate runs after the benchmark doc PR is created, so the raw numbers still land in generated/benchmarks/BUILD-BENCHMARKS.md even when parity regresses; only the workflow status goes red to alert maintainers. Thresholds reference the currently-open parity bugs on v3.9.5: - File-set gap |wasm - native| ≤ 2 (#1011) - DB size ratio native/wasm ≤ 1.02 (#1010) - Full-build edges-phase ratio ≤ 1.30 (#1013) - Full-build roles-phase ratio ≤ 1.30 (#1013) - 1-file incremental ratio ≤ 1.50 (#1012) The gate writes a markdown table to \$GITHUB_STEP_SUMMARY showing pass/fail per threshold with a direct link to the tracking issue, so reviewers see the regression at a glance without digging through logs. No behavior change on the passing path — when both engines are within thresholds the step exits 0 silently. Impact: 2 functions changed, 2 affected

claude · 2026-04-24T06:27:22Z

Claude finished @carlos-alm's task —— View job

Reviewing PR #1014: ci(bench): gate release benchmark on engine parity thresholds

Todo List

Read repository CLAUDE.md for context and guidelines
Examine changed files: .github/workflows/benchmark.yml and scripts/benchmark-parity-gate.mjs
Conduct Phase 1: Root Cause Analysis
Conduct Phase 2: Configuration Change Detection
Conduct Phase 3: Test Integrity
Conduct Phase 4: Code Quality
Conduct Phase 5: Backlog Compliance
Provide final assessment and rating

Root Cause Analysis

Problem Being Solved: The benchmark workflow runs after releases but provides no automated feedback on engine parity regressions. The v3.9.5 benchmark failed with "Both engines failed. No results to report." due to timeouts, and even when successful, maintainers must manually check for parity issues across 5+ metrics between native and WASM engines.

Why This Approach: Adding a parity gate as the final step preserves the existing documentation behavior (benchmark data still gets recorded in PRs) while providing automated failure signals to maintainers. The thresholds are calibrated tighter than current v3.9.5 numbers, ensuring the gate will surface known issues (#1010-#1013) until they're fixed.

Risk Assessment: Low-risk addition. The gate runs after all documentation is complete, so failures don't prevent historical record-keeping. The script is defensive (handles missing engines gracefully) and the thresholds are well-documented with issue links.

Configuration Change Detection (HIGH SCRUTINY)

No concerning changes detected. The workflow addition is purely additive:

New step added after existing doc PR creation (lines 204-206)
No existing rules, timeouts, or thresholds relaxed
No test configuration changes
The script hardcodes reasonable thresholds that align with project goals

Test Integrity

No test modifications. This PR adds CI validation but doesn't touch the test suite. The gate validates production benchmark results rather than test outputs, which is appropriate.

Verification approach: Author smoke-tested locally against v3.9.5 data and confirmed all 5 checks fail as expected (showing the gate works). This is adequate for a simple validation script.

Code Quality

benchmark.yml (lines 204-206):
✅ Proper placement: Runs after doc PR creation so historical data is preserved regardless of gate outcome
✅ Defensive conditions: Respects the same if: steps.existing.outputs.skip != 'true' condition as other steps
✅ Clear intent: Comments explain the design decision (lines 201-203)

scripts/benchmark-parity-gate.mjs:
✅ Robust error handling:

Validates command line args (lines 25-29)
Handles missing engine results gracefully (lines 47-51)
Defensive calculations with Math.max(..., 1) to prevent division by zero (lines 77, 84, 93)

✅ Clear data structures: Threshold checks are well-organized with descriptive names, limits, formatters, and issue tracking (lines 60-98)

✅ Appropriate output: Uses GitHub Actions summary alongside console output (lines 34-42), proper exit codes

⚠️ Minor concern: Division by zero protection could be more explicit. While Math.max(wasm.phases?.edgesMs ?? 1, 1) works, using dedicated checks for zero denominators would be clearer.

✅ Security: No external dependencies, no shell execution, straightforward file I/O

Backlog Compliance

Zero-dep: ✅ Uses only Node.js built-ins (fs, path). No new dependencies.

Foundation-aligned: ✅ Fully compliant with all principles:

P1 (always-current): Validates live benchmark data from actual builds
P4 (zero-cost core): Pure validation, no runtime overhead
P8 (not an agent): Passive gate that reports but doesn't take action
P5 (embeddable-first): Doesn't affect the programmatic API

Problem-fit: 4/5 — Directly supports the core problem by preventing engine parity regressions that would cause AI assistants to get inconsistent results when switching engines. The CLAUDE.md explicitly states "Both engines must produce identical results" — this enforces that requirement.

Breaking: No — Pure addition to CI workflow, no API or behavior changes

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. The implementation is well-designed and follows project conventions.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)

Action: APPROVE

Reasoning: This is a high-quality, well-thought-out addition that directly addresses a real operational need. The implementation is robust, well-documented, and follows project conventions. The gate design (running after doc updates) cleverly preserves historical data while providing automated feedback. The thresholds are appropriate and reference the specific issues they track.

Why 4/5 instead of 5/5: While excellent overall, this is fundamentally infrastructure automation rather than breakthrough functionality. The minor division-by-zero handling could be cleaner, and the approach is straightforward rather than innovative.

The PR strongly supports the project's dual-engine architecture goals and will provide valuable continuous integration feedback on engine parity — exactly what's needed as related issues #1009-#1013 are resolved.

greptile-apps · 2026-04-24T06:29:38Z

Greptile Summary

This PR adds an engine parity gate to the existing release benchmark workflow: after the doc PR is opened, a new step runs benchmark-parity-gate.mjs against benchmark-result.json and fails the job if any of five wasm/native divergence thresholds are breached. The script is well-structured, the previously-flagged issues (zero-guards, error handling, exit codes) have all been addressed, and the step ordering correctly preserves the historical record even on gate failure.

Confidence Score: 4/5

Safe to merge; the two remaining findings are both P2 quality/resilience suggestions that don't affect correctness on the happy path.

No P0 or P1 issues remain. Two P2 findings: the implicit success() in the gate's if condition (gate silently skipped on PR-step failure) and the missing version fallback producing vundefined in the summary header. Neither blocks the intended alerting behaviour on a normal run.

No files require special attention beyond the two P2 suggestions above.

Important Files Changed

Filename	Overview
scripts/benchmark-parity-gate.mjs	New gate script — file I/O and error handling look solid after prior round of fixes; minor nit around missing `version` fallback in the summary header.
.github/workflows/benchmark.yml	Adds one new step at the end of `build-benchmark`; step ordering and skip guard are correct, but implicit `success()` means the gate is silently skipped when the PR-creation step fails.

Sequence Diagram

sequenceDiagram
    participant P as Publish workflow
    participant B as build-benchmark job
    participant G as benchmark-parity-gate.mjs
    participant S as $GITHUB_STEP_SUMMARY

    P->>B: workflow_run (completed)
    B->>B: Run build benchmark → benchmark-result.json
    B->>B: Merge resolution result into benchmark-result.json
    B->>B: Update build report
    B->>B: Commit & push via PR (if changes exist)
    B->>G: node benchmark-parity-gate.mjs benchmark-result.json
    G->>G: Read & parse benchmark-result.json
    alt wasm or native missing
        G->>S: SKIP — gate passes (exit 0)
    else both engines present
        G->>G: Compute 5 parity checks
        G->>S: Write markdown table
        alt any threshold breached
            G-->>B: exit 1 → job fails (alert maintainers)
        else all pass
            G-->>B: exit 0 → job succeeds
        end
    end

_{Reviews (2): Last reviewed commit: "fix: guard parity gate against missing/z..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T06:29:42Z

+
+line(`## Engine parity gate — v${version}`);
+line('');
+
+if (!wasm || !native) {
+	const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
+	line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);
+	process.exit(1);


Missing-engine case exits 1, contradicting the inline comment

The script-level JSDoc on line 19 explicitly states: "If one engine failed, we leave the gate passing…" but process.exit(1) on line 50 does the opposite — it fails the gate. Because the gate runs after the doc PR step, a red job here won't block any data from being recorded, but it will fire a false-alarm in every run where only one engine finishes (e.g., during the window right before #1009 is fully deployed).

Suggested change

line(`## Engine parity gate — v${version}`);

line('');

if (!wasm || !native) {

const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');

line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);

process.exit(1);

if (!wasm || !native) {

const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');

line(`**SKIP:** missing engine result for: ${missing}. Cannot assert parity — gate passes.`);

process.exit(0);

}

Fixed in b789411 — missing-engine branch now logs SKIP and exits 0, matching the documented pass-through intent in the file-level JSDoc.

greptile-apps · 2026-04-24T06:29:43Z

+		name: 'File-set gap (|wasm − native|)',
+		actual: Math.abs(wasm.files - native.files),
+		limit: 2,
+		formatter: (v) => String(v),
+		tracks: '#1011',


wasm.dbSizeBytes has no zero-guard, unlike every timing check

All four timing-based ratio checks defensively use Math.max(... ?? 1, 1) as the denominator. The DB-size ratio does not. If wasm.dbSizeBytes is 0 (empty DB from a partial wasm run), the result is Infinity, which fails the <= 1.02 threshold — a false positive. If it is undefined, the result is NaN, which also fails.

Suggested change

name: 'File-set gap (|wasm − native|)',

actual: Math.abs(wasm.files - native.files),

limit: 2,

formatter: (v) => String(v),

tracks: '#1011',

{

name: 'DB size ratio (native / wasm)',

actual: native.dbSizeBytes / Math.max(wasm.dbSizeBytes ?? 1, 1),

limit: 1.02,

formatter: (v) => v.toFixed(3),

tracks: '#1010',

},

Fixed in b789411 — DB size ratio now uses (native.dbSizeBytes ?? 0) / Math.max(wasm.dbSizeBytes ?? 1, 1), matching the zero-guard pattern used by all four timing checks. Verified locally with wasm.dbSizeBytes = 0 — no more Infinity.

greptile-apps · 2026-04-24T06:29:44Z

+//   limit      — ceiling; actual must be ≤ limit
+//   formatter  — how to render the value
+//   tracks     — related issue link shown on failure
+const checks = [
+	{


File-set gap check doesn't guard against undefined fields

If either wasm.files or native.files is undefined (e.g., a partial result), Math.abs(undefined - undefined) produces NaN. NaN <= 2 is false, so the check silently fails as a false positive with no indication in the summary of why. Worth adding a ?? 0 fallback consistent with how the timing checks handle missing data.

Suggested change

// limit — ceiling; actual must be ≤ limit

// formatter — how to render the value

// tracks — related issue link shown on failure

const checks = [

{

{

name: 'File-set gap (|wasm − native|)',

actual: Math.abs((wasm.files ?? 0) - (native.files ?? 0)),

limit: 2,

formatter: (v) => String(v),

tracks: '#1011',

},

Fixed in b789411 — file-set gap now uses Math.abs((wasm.files ?? 0) - (native.files ?? 0)) so undefined fields no longer collapse to NaN.

greptile-apps · 2026-04-24T06:29:45Z

+}
+
+const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));


readFileSync throws an opaque stack trace on missing file

If benchmark-result.json doesn't exist (e.g., the build benchmark step was skipped by a prior failure), Node throws a raw ENOENT stack trace. Wrapping this in a try/catch gives maintainers a cleaner diagnostic message that matches the script's own error-reporting style.

Suggested change

}

const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));

let result;

try {

result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));

} catch (err) {

console.error(`Failed to read ${resultFile}: ${err.message}`);

process.exit(2);

}

const { wasm, native, version } = result;

Fixed in b789411 — readFileSync/JSON.parse wrapped in try/catch, exits 2 with a clean Failed to read <file>: <err.message> diagnostic instead of a raw ENOENT stack trace.

github-actions · 2026-04-24T06:34:43Z

Codegraph Impact Analysis

5 functions changed → 6 callers affected across 3 files

wasm in scripts/benchmark-parity-gate.mjs:37 (4 transitive callers)
native in scripts/benchmark-parity-gate.mjs:37 (4 transitive callers)
version in scripts/benchmark-parity-gate.mjs:37 (0 transitive callers)
writeSummary in scripts/benchmark-parity-gate.mjs:40 (2 transitive callers)
line in scripts/benchmark-parity-gate.mjs:44 (1 transitive callers)

- Missing engine now exits 0 (gate passes) to match the documented behavior in the script-level JSDoc, instead of firing a false-alarm red job when only one engine finishes. - Add nullish/zero guards to file-set gap and DB-size ratio checks so undefined or zero fields in a partial result no longer produce NaN or Infinity. - Wrap the result-file read in try/catch to print a clean diagnostic instead of a raw ENOENT stack trace. - Drop the unused path import.

carlos-alm · 2026-04-24T21:35:04Z

@greptileai

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

Merge branch 'main' into ci/benchmark-parity-gate

8a2765a

carlos-alm merged commit 2e86f56 into main Apr 24, 2026
14 checks passed

carlos-alm deleted the ci/benchmark-parity-gate branch April 24, 2026 21:54

github-actions Bot locked and limited conversation to collaborators Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(bench): gate release benchmark on engine parity thresholds#1014

ci(bench): gate release benchmark on engine parity thresholds#1014
carlos-alm merged 3 commits intomainfrom
ci/benchmark-parity-gate

carlos-alm commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

carlos-alm Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

carlos-alm Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

carlos-alm Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

carlos-alm Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		}

		const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));

-}
-const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
+let result;
+try {
+	result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
+} catch (err) {
+	console.error(`Failed to read ${resultFile}: ${err.message}`);
+	process.exit(2);
+}
+const { wasm, native, version } = result;

Conversation

carlos-alm commented Apr 24, 2026

Summary

Behavior

Thresholds

Verification

Test plan

Related

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #1014: ci(bench): gate release benchmark on engine parity thresholds

Todo List

Root Cause Analysis

Configuration Change Detection (HIGH SCRUTINY)

Test Integrity

Code Quality

Backlog Compliance

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 24, 2026 •

edited

Loading

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading