Skip to content

ci(bench): gate release benchmark on engine parity thresholds#1014

Merged
carlos-alm merged 3 commits intomainfrom
ci/benchmark-parity-gate
Apr 24, 2026
Merged

ci(bench): gate release benchmark on engine parity thresholds#1014
carlos-alm merged 3 commits intomainfrom
ci/benchmark-parity-gate

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

The Benchmark workflow already fires after every release via workflow_run on Publish completion. It ran for v3.9.5 on 2026-04-23 but failed silently with Both engines failed. No results to report. (both workers hit the 600 s timeout — the hang fixed by #1009).

This PR extends the release-triggered benchmark with an engine parity gate: after both engines run and the benchmark doc PR is opened, a new step validates parity thresholds against the merged benchmark-result.json and fails the workflow if any threshold is breached.

Behavior

  • Runs after the doc PR is created, so raw numbers still land in generated/benchmarks/BUILD-BENCHMARKS.md on every release — parity regressions don't wipe the historical record.
  • Writes a markdown table to $GITHUB_STEP_SUMMARY with pass/fail per check and a link to the tracking issue, so reviewers see the regression at a glance.
  • Exits 0 silently when all thresholds pass. No churn on healthy releases.

Thresholds

Check Limit Tracks
File-set gap (|wasm − native|) ≤ 2 #1011
DB size ratio (native / wasm) ≤ 1.02 #1010
Full-build edges-phase ratio ≤ 1.30 #1013
Full-build roles-phase ratio ≤ 1.30 #1013
1-file incremental ratio ≤ 1.50 #1012

Limits are tighter than the current v3.9.5 numbers, so the gate will fail on the first release after #1009 merges — that's intentional. The failure is the dashboard: maintainers see which of the four open parity bugs are still biting each release. As each bug is fixed, the next release's gate steps closer to green.

Verification

Smoke-tested the gate script locally against the v3.9.5 benchmark numbers — all 5 checks correctly fail with the right issue links:

| File-set gap (|wasm − native|) | 49   | 2    | :x: fail | #1011 |
| DB size ratio (native / wasm)  | 1.075| 1.020| :x: fail | #1010 |
| Full-build edges-phase ratio   | 1.74 | 1.30 | :x: fail | #1013 |
| Full-build roles-phase ratio   | 4.35 | 1.30 | :x: fail | #1013 |
| 1-file incremental ratio       | 4.32 | 1.50 | :x: fail | #1012 |

Passing case also verified with a hypothetical "all engines aligned" input — exits 0, summary shows all green.

Test plan

Related

The Benchmark workflow already runs after every release (on workflow_run
completion of Publish). Add a parity gate to the build-benchmark job so
drift between the native and wasm engines fails the workflow — catching
regressions the benchmark data would otherwise silently record.

The gate runs after the benchmark doc PR is created, so the raw numbers
still land in generated/benchmarks/BUILD-BENCHMARKS.md even when parity
regresses; only the workflow status goes red to alert maintainers.

Thresholds reference the currently-open parity bugs on v3.9.5:
  - File-set gap |wasm - native| ≤ 2            (#1011)
  - DB size ratio native/wasm ≤ 1.02            (#1010)
  - Full-build edges-phase ratio ≤ 1.30         (#1013)
  - Full-build roles-phase ratio ≤ 1.30         (#1013)
  - 1-file incremental ratio ≤ 1.50             (#1012)

The gate writes a markdown table to \$GITHUB_STEP_SUMMARY showing pass/fail
per threshold with a direct link to the tracking issue, so reviewers see
the regression at a glance without digging through logs.

No behavior change on the passing path — when both engines are within
thresholds the step exits 0 silently.

Impact: 2 functions changed, 2 affected
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR #1014: ci(bench): gate release benchmark on engine parity thresholds

Todo List

  • Read repository CLAUDE.md for context and guidelines
  • Examine changed files: .github/workflows/benchmark.yml and scripts/benchmark-parity-gate.mjs
  • Conduct Phase 1: Root Cause Analysis
  • Conduct Phase 2: Configuration Change Detection
  • Conduct Phase 3: Test Integrity
  • Conduct Phase 4: Code Quality
  • Conduct Phase 5: Backlog Compliance
  • Provide final assessment and rating

Root Cause Analysis

Problem Being Solved: The benchmark workflow runs after releases but provides no automated feedback on engine parity regressions. The v3.9.5 benchmark failed with "Both engines failed. No results to report." due to timeouts, and even when successful, maintainers must manually check for parity issues across 5+ metrics between native and WASM engines.

Why This Approach: Adding a parity gate as the final step preserves the existing documentation behavior (benchmark data still gets recorded in PRs) while providing automated failure signals to maintainers. The thresholds are calibrated tighter than current v3.9.5 numbers, ensuring the gate will surface known issues (#1010-#1013) until they're fixed.

Risk Assessment: Low-risk addition. The gate runs after all documentation is complete, so failures don't prevent historical record-keeping. The script is defensive (handles missing engines gracefully) and the thresholds are well-documented with issue links.

Configuration Change Detection (HIGH SCRUTINY)

No concerning changes detected. The workflow addition is purely additive:

  • New step added after existing doc PR creation (lines 204-206)
  • No existing rules, timeouts, or thresholds relaxed
  • No test configuration changes
  • The script hardcodes reasonable thresholds that align with project goals

Test Integrity

No test modifications. This PR adds CI validation but doesn't touch the test suite. The gate validates production benchmark results rather than test outputs, which is appropriate.

Verification approach: Author smoke-tested locally against v3.9.5 data and confirmed all 5 checks fail as expected (showing the gate works). This is adequate for a simple validation script.

Code Quality

benchmark.yml (lines 204-206):
Proper placement: Runs after doc PR creation so historical data is preserved regardless of gate outcome
Defensive conditions: Respects the same if: steps.existing.outputs.skip != 'true' condition as other steps
Clear intent: Comments explain the design decision (lines 201-203)

scripts/benchmark-parity-gate.mjs:
Robust error handling:

  • Validates command line args (lines 25-29)
  • Handles missing engine results gracefully (lines 47-51)
  • Defensive calculations with Math.max(..., 1) to prevent division by zero (lines 77, 84, 93)

Clear data structures: Threshold checks are well-organized with descriptive names, limits, formatters, and issue tracking (lines 60-98)

Appropriate output: Uses GitHub Actions summary alongside console output (lines 34-42), proper exit codes

⚠️ Minor concern: Division by zero protection could be more explicit. While Math.max(wasm.phases?.edgesMs ?? 1, 1) works, using dedicated checks for zero denominators would be clearer.

Security: No external dependencies, no shell execution, straightforward file I/O

Backlog Compliance

Zero-dep: ✅ Uses only Node.js built-ins (fs, path). No new dependencies.

Foundation-aligned: ✅ Fully compliant with all principles:

  • P1 (always-current): Validates live benchmark data from actual builds
  • P4 (zero-cost core): Pure validation, no runtime overhead
  • P8 (not an agent): Passive gate that reports but doesn't take action
  • P5 (embeddable-first): Doesn't affect the programmatic API

Problem-fit: 4/5 — Directly supports the core problem by preventing engine parity regressions that would cause AI assistants to get inconsistent results when switching engines. The CLAUDE.md explicitly states "Both engines must produce identical results" — this enforces that requirement.

Breaking: No — Pure addition to CI workflow, no API or behavior changes

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. The implementation is well-designed and follows project conventions.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)

Action: APPROVE

Reasoning: This is a high-quality, well-thought-out addition that directly addresses a real operational need. The implementation is robust, well-documented, and follows project conventions. The gate design (running after doc updates) cleverly preserves historical data while providing automated feedback. The thresholds are appropriate and reference the specific issues they track.

Why 4/5 instead of 5/5: While excellent overall, this is fundamentally infrastructure automation rather than breakthrough functionality. The minor division-by-zero handling could be cleaner, and the approach is straightforward rather than innovative.

The PR strongly supports the project's dual-engine architecture goals and will provide valuable continuous integration feedback on engine parity — exactly what's needed as related issues #1009-#1013 are resolved.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR adds an engine parity gate to the existing release benchmark workflow: after the doc PR is opened, a new step runs benchmark-parity-gate.mjs against benchmark-result.json and fails the job if any of five wasm/native divergence thresholds are breached. The script is well-structured, the previously-flagged issues (zero-guards, error handling, exit codes) have all been addressed, and the step ordering correctly preserves the historical record even on gate failure.

Confidence Score: 4/5

Safe to merge; the two remaining findings are both P2 quality/resilience suggestions that don't affect correctness on the happy path.

No P0 or P1 issues remain. Two P2 findings: the implicit success() in the gate's if condition (gate silently skipped on PR-step failure) and the missing version fallback producing vundefined in the summary header. Neither blocks the intended alerting behaviour on a normal run.

No files require special attention beyond the two P2 suggestions above.

Important Files Changed

Filename Overview
scripts/benchmark-parity-gate.mjs New gate script — file I/O and error handling look solid after prior round of fixes; minor nit around missing version fallback in the summary header.
.github/workflows/benchmark.yml Adds one new step at the end of build-benchmark; step ordering and skip guard are correct, but implicit success() means the gate is silently skipped when the PR-creation step fails.

Sequence Diagram

sequenceDiagram
    participant P as Publish workflow
    participant B as build-benchmark job
    participant G as benchmark-parity-gate.mjs
    participant S as $GITHUB_STEP_SUMMARY

    P->>B: workflow_run (completed)
    B->>B: Run build benchmark → benchmark-result.json
    B->>B: Merge resolution result into benchmark-result.json
    B->>B: Update build report
    B->>B: Commit & push via PR (if changes exist)
    B->>G: node benchmark-parity-gate.mjs benchmark-result.json
    G->>G: Read & parse benchmark-result.json
    alt wasm or native missing
        G->>S: SKIP — gate passes (exit 0)
    else both engines present
        G->>G: Compute 5 parity checks
        G->>S: Write markdown table
        alt any threshold breached
            G-->>B: exit 1 → job fails (alert maintainers)
        else all pass
            G-->>B: exit 0 → job succeeds
        end
    end
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "fix: guard parity gate against missing/z..." | Re-trigger Greptile

Comment thread scripts/benchmark-parity-gate.mjs Outdated
Comment on lines +43 to +50

line(`## Engine parity gate — v${version}`);
line('');

if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);
process.exit(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing-engine case exits 1, contradicting the inline comment

The script-level JSDoc on line 19 explicitly states: "If one engine failed, we leave the gate passing…" but process.exit(1) on line 50 does the opposite — it fails the gate. Because the gate runs after the doc PR step, a red job here won't block any data from being recorded, but it will fire a false-alarm in every run where only one engine finishes (e.g., during the window right before #1009 is fully deployed).

Suggested change
line(`## Engine parity gate — v${version}`);
line('');
if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**FAIL:** missing engine result for: ${missing}. Benchmark cannot assert parity.`);
process.exit(1);
if (!wasm || !native) {
const missing = [!wasm && 'wasm', !native && 'native'].filter(Boolean).join(', ');
line(`**SKIP:** missing engine result for: ${missing}. Cannot assert parity — gate passes.`);
process.exit(0);
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — missing-engine branch now logs SKIP and exits 0, matching the documented pass-through intent in the file-level JSDoc.

Comment on lines +62 to +66
name: 'File-set gap (|wasm − native|)',
actual: Math.abs(wasm.files - native.files),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 wasm.dbSizeBytes has no zero-guard, unlike every timing check

All four timing-based ratio checks defensively use Math.max(... ?? 1, 1) as the denominator. The DB-size ratio does not. If wasm.dbSizeBytes is 0 (empty DB from a partial wasm run), the result is Infinity, which fails the <= 1.02 threshold — a false positive. If it is undefined, the result is NaN, which also fails.

Suggested change
name: 'File-set gap (|wasm − native|)',
actual: Math.abs(wasm.files - native.files),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
{
name: 'DB size ratio (native / wasm)',
actual: native.dbSizeBytes / Math.max(wasm.dbSizeBytes ?? 1, 1),
limit: 1.02,
formatter: (v) => v.toFixed(3),
tracks: '#1010',
},

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — DB size ratio now uses (native.dbSizeBytes ?? 0) / Math.max(wasm.dbSizeBytes ?? 1, 1), matching the zero-guard pattern used by all four timing checks. Verified locally with wasm.dbSizeBytes = 0 — no more Infinity.

Comment on lines +57 to +61
// limit — ceiling; actual must be ≤ limit
// formatter — how to render the value
// tracks — related issue link shown on failure
const checks = [
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 File-set gap check doesn't guard against undefined fields

If either wasm.files or native.files is undefined (e.g., a partial result), Math.abs(undefined - undefined) produces NaN. NaN <= 2 is false, so the check silently fails as a false positive with no indication in the summary of why. Worth adding a ?? 0 fallback consistent with how the timing checks handle missing data.

Suggested change
// limit — ceiling; actual must be ≤ limit
// formatter — how to render the value
// tracks — related issue link shown on failure
const checks = [
{
{
name: 'File-set gap (|wasm − native|)',
actual: Math.abs((wasm.files ?? 0) - (native.files ?? 0)),
limit: 2,
formatter: (v) => String(v),
tracks: '#1011',
},

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411 — file-set gap now uses Math.abs((wasm.files ?? 0) - (native.files ?? 0)) so undefined fields no longer collapse to NaN.

Comment thread scripts/benchmark-parity-gate.mjs Outdated
Comment on lines +29 to +31
}

const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 readFileSync throws an opaque stack trace on missing file

If benchmark-result.json doesn't exist (e.g., the build benchmark step was skipped by a prior failure), Node throws a raw ENOENT stack trace. Wrapping this in a try/catch gives maintainers a cleaner diagnostic message that matches the script's own error-reporting style.

Suggested change
}
const result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
let result;
try {
result = JSON.parse(fs.readFileSync(resultFile, 'utf8'));
} catch (err) {
console.error(`Failed to read ${resultFile}: ${err.message}`);
process.exit(2);
}
const { wasm, native, version } = result;

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b789411readFileSync/JSON.parse wrapped in try/catch, exits 2 with a clean Failed to read <file>: <err.message> diagnostic instead of a raw ENOENT stack trace.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Codegraph Impact Analysis

5 functions changed6 callers affected across 3 files

  • wasm in scripts/benchmark-parity-gate.mjs:37 (4 transitive callers)
  • native in scripts/benchmark-parity-gate.mjs:37 (4 transitive callers)
  • version in scripts/benchmark-parity-gate.mjs:37 (0 transitive callers)
  • writeSummary in scripts/benchmark-parity-gate.mjs:40 (2 transitive callers)
  • line in scripts/benchmark-parity-gate.mjs:44 (1 transitive callers)

- Missing engine now exits 0 (gate passes) to match the documented
  behavior in the script-level JSDoc, instead of firing a false-alarm
  red job when only one engine finishes.
- Add nullish/zero guards to file-set gap and DB-size ratio checks so
  undefined or zero fields in a partial result no longer produce NaN
  or Infinity.
- Wrap the result-file read in try/catch to print a clean diagnostic
  instead of a raw ENOENT stack trace.
- Drop the unused path import.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 2e86f56 into main Apr 24, 2026
14 checks passed
@carlos-alm carlos-alm deleted the ci/benchmark-parity-gate branch April 24, 2026 21:54
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant