You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ci: unified benchmark suite with full baselines and regression gate (#121)
* ci: unified benchmark suite with full baselines and regression gate
* fix(bench): address review feedback and seed ubuntu baselines
* fix(bench): gate all baseline benchmarks and validate finite ratios
* fix(bench): harden reduce_baselines and fix Python 3.10 CI
* fix(bench): address bradjin8 review on search, cache, and baselines
* chore(bench): refresh baselines from ubuntu CI run 28206463463
* bench: exclude round_trip from gate; refresh baselines from latest CI
test_summary_cache_round_trip calls set/get each round; OS page-cache
state causes 3-5x variation between CI runs (0.000314s vs 0.001137s).
Add to EXCLUDED_FROM_GATE with comment; baseline kept for observation.
Regenerate baselines.json from run 28206913751 (ubuntu-latest, 1.5x
slack). Update README to document the exclusion and rationale.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
| export |`POST /api/export` (ZIP) over 10 / 50 composer corpora (capped at 50 for CI runtime; parse goes to 200) |
21
+
| search |`GET /api/search` over a 50-composer corpus — **live-scan** (`test_search_full_corpus_live_scan`, `NO_SEARCH_INDEX=1`) and **FTS index** (`test_search_full_corpus_indexed`, pre-built index) |
Synthetic corpora are built in `tests/benchmarks/conftest.py` — no real Cursor storage dependency.
25
+
26
+
### Adding a benchmark group
27
+
28
+
Every `@pytest.mark.benchmark(group="...")` name must appear in `GATED_GROUPS` inside `scripts/reduce_baselines.py`. Otherwise `reduce_baselines.py` fails at refresh time with an unknown-group error. Update both the test marker and `GATED_GROUPS` when introducing a new group.
29
+
30
+
## CI gate
31
+
32
+
The `benchmarks` job on **ubuntu-latest** runs the full `tests/benchmarks/` suite (`--benchmark-json=benchmark-results.json`), then `scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json`.
33
+
34
+
-**Fail** when a gated mean exceeds its baseline by **>20%**
35
+
-**Fail** when a gated mean is **<50%** of baseline (stale — refresh after intentional speedups)
36
+
-**Fail** when a gated baseline name has no current result
37
+
-**Warn** for benchmarks without a baseline entry
38
+
- All benchmarks listed in `baselines.json` are gated unless named in `EXCLUDED_FROM_GATE` in `scripts/check_benchmark_regression.py`
Sub-millisecond benches (e.g. `test_summary_cache_lookup`, `test_composer_map_cache_lookup`) can be high-variance on shared runners. If the gate becomes flaky, raise `--slack` for those entries or add targeted exclusions in `EXCLUDED_FROM_GATE`.
43
+
44
+
`test_summary_cache_round_trip` is intentionally excluded from the gate: it calls `set_cached_projects` (file write) + `get_cached_projects` (file read) each round, so OS page-cache state on shared runners causes 3–5x variation between consecutive CI runs. The baseline entry is kept for observation only.
45
+
46
+
## Refresh baselines
47
+
48
+
After intentional performance work, capture on **ubuntu-latest** (same OS as the gated CI job). Download `benchmark-results.json` from a CI artifact when possible:
For a quick local snapshot only (may not match CI timings):
55
+
56
+
```bash
57
+
make seed-baselines-local
58
+
# writes benchmarks/_raw.json only; does not overwrite benchmarks/baselines.json
59
+
make seed-baselines-local FORCE=1 # also runs reduce_baselines into benchmarks/baselines.json
60
+
```
61
+
62
+
`make update-baselines` is a deprecated alias for `seed-baselines-local`. Do not commit baselines from macOS/Windows unless you accept cross-OS gate skew.
63
+
64
+
## Makefile targets
65
+
66
+
| Target | Purpose |
67
+
|--------|---------|
68
+
|`make check-benchmarks`| Run suite + regression gate locally |
69
+
|`make seed-baselines-local`| Capture local timings to `benchmarks/_raw.json` (use `FORCE=1` to update `baselines.json`) |
70
+
|`make clean-benchmark-artifacts`| Remove `benchmark-results.json` and `benchmarks/_raw.json`|
"_note": "Gated means from ubuntu-latest CI benchmark-results.json (PR #120, run 28123677675). Refresh after intentional perf changes: download benchmark-results.json from the CI artifacts job, then `python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json` (re-seed with reduce_baselines or edit means). Local capture: `pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmark-results.json -o addopts=` on ubuntu-latest.",
3
-
"updated": "2026-06-24T19:20:27Z",
2
+
"_note": "Gated means from ubuntu-latest CI benchmark-results.json. Values multiplied by 1.5x slack at generation time. Excluded from gate (recorded for reference): test_summary_cache_round_trip. Refresh after intentional speedups via reduce_baselines.py.",
0 commit comments