Skip to content

Commit fb97d03

Browse files
ci: unified benchmark suite with full baselines and regression gate (#121)
* ci: unified benchmark suite with full baselines and regression gate * fix(bench): address review feedback and seed ubuntu baselines * fix(bench): gate all baseline benchmarks and validate finite ratios * fix(bench): harden reduce_baselines and fix Python 3.10 CI * fix(bench): address bradjin8 review on search, cache, and baselines * chore(bench): refresh baselines from ubuntu CI run 28206463463 * bench: exclude round_trip from gate; refresh baselines from latest CI test_summary_cache_round_trip calls set/get each round; OS page-cache state causes 3-5x variation between CI runs (0.000314s vs 0.001137s). Add to EXCLUDED_FROM_GATE with comment; baseline kept for observation. Regenerate baselines.json from run 28206913751 (ubuntu-latest, 1.5x slack). Update README to document the exclusion and rationale. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent ecd08c6 commit fb97d03

15 files changed

Lines changed: 915 additions & 37 deletions

.github/workflows/tests.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ jobs:
115115
# exercise Flask routes via app.test_client(). Only listed files — not
116116
# `pytest tests/` — to avoid re-collecting unittest.TestCase classes above.
117117
# -o addopts= avoids inheriting benchmark-only options from pyproject.toml.
118-
run: python -m pytest tests/test_api_search.py tests/test_api_workspaces.py tests/test_api_export.py tests/test_pdf_export.py tests/test_search_helpers.py tests/test_check_benchmark_regression.py -v --tb=short -o addopts=
118+
run: python -m pytest tests/test_api_search.py tests/test_api_workspaces.py tests/test_api_export.py tests/test_pdf_export.py tests/test_search_helpers.py tests/test_check_benchmark_regression.py tests/test_reduce_baselines.py -v --tb=short -o addopts=
119119

120120
# ── PyInstaller desktop build (Windows only, once per workflow) ────────
121121
# Closes #44. Builds the onedir bundle and smoke-tests --help so the
@@ -215,7 +215,7 @@ jobs:
215215
--redact \
216216
--exit-code 1
217217
218-
# ── Performance benchmarks: summary cache (issue #115) ─────────────────────
218+
# ── Performance benchmarks: unified suite (issues #115, #110) ──────────────
219219
benchmarks:
220220
name: Performance benchmarks (gated)
221221
needs: [unittest]
@@ -236,7 +236,7 @@ jobs:
236236
python -m pip install -r requirements-lock.txt
237237
python -m pip install 'pytest>=8,<9' 'pytest-benchmark==4.0.0'
238238
239-
- name: Run summary-cache benchmarks
239+
- name: Run benchmark suite
240240
run: >
241241
python -m pytest tests/benchmarks/
242242
--benchmark-only

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,5 @@ coverage.xml
4646
.hypothesis/
4747
benchmark-results.json
4848
benchmarks/_raw.json
49+
benchmarks/_merged.json
50+
benchmarks/_ci/

Makefile

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
.PHONY: seed-baselines-local update-baselines check-benchmarks clean-benchmark-artifacts
2+
3+
# WARNING: captures timings on THIS machine. Production baselines must match ubuntu-latest CI.
4+
# Prefer downloading benchmark-results.json from a CI artifact, then:
5+
# python scripts/reduce_baselines.py benchmark-results.json benchmarks/baselines.json --slack 1.5
6+
seed-baselines-local:
7+
@echo "WARNING: seed-baselines-local uses this host's timings; CI gates on ubuntu-latest." >&2
8+
python -m pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmarks/_raw.json -o addopts=
9+
python -c "import os, subprocess, sys; \
10+
cmd = [sys.executable, 'scripts/reduce_baselines.py', 'benchmarks/_raw.json', 'benchmarks/baselines.json', '--slack', '1.5', '--source', 'local']; \
11+
(subprocess.run(cmd, check=True), print('Updated benchmarks/baselines.json', file=sys.stderr)) if os.environ.get('FORCE') == '1' else print('Wrote benchmarks/_raw.json only. Set FORCE=1 to overwrite benchmarks/baselines.json.', file=sys.stderr)"
12+
13+
# Deprecated alias — kept for muscle memory; see seed-baselines-local warning above.
14+
update-baselines: seed-baselines-local
15+
16+
check-benchmarks:
17+
python -m pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmark-results.json -o addopts=
18+
python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json
19+
20+
clean-benchmark-artifacts:
21+
python -c "import pathlib; [p.unlink(missing_ok=True) for p in (pathlib.Path('benchmarks/_raw.json'), pathlib.Path('benchmark-results.json'))]"

benchmarks/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Performance benchmarks
2+
3+
Test files live under `tests/benchmarks/`; this directory holds documentation and `baselines.json` for the CI regression gate.
4+
5+
Repeatable local measurements for workspace listing, export, search, and summary-cache hot paths.
6+
7+
## Run locally
8+
9+
```bash
10+
pip install -r requirements-lock.txt
11+
pip install 'pytest>=8,<9' 'pytest-benchmark==4.0.0'
12+
pytest tests/benchmarks/ --benchmark-only -o addopts= -v
13+
```
14+
15+
## Scenarios
16+
17+
| Group | What |
18+
|-------|------|
19+
| parse | `list_workspace_projects(..., nocache=True)` over 10 / 50 / 200 synthetic composers |
20+
| export | `POST /api/export` (ZIP) over 10 / 50 composer corpora (capped at 50 for CI runtime; parse goes to 200) |
21+
| search | `GET /api/search` over a 50-composer corpus — **live-scan** (`test_search_full_corpus_live_scan`, `NO_SEARCH_INDEX=1`) and **FTS index** (`test_search_full_corpus_indexed`, pre-built index) |
22+
| summary-cache | projects lookup (hit/miss), composer-map lookup (hit/miss), fingerprint (10/50/200), round-trip, tab-summary lookup |
23+
24+
Synthetic corpora are built in `tests/benchmarks/conftest.py` — no real Cursor storage dependency.
25+
26+
### Adding a benchmark group
27+
28+
Every `@pytest.mark.benchmark(group="...")` name must appear in `GATED_GROUPS` inside `scripts/reduce_baselines.py`. Otherwise `reduce_baselines.py` fails at refresh time with an unknown-group error. Update both the test marker and `GATED_GROUPS` when introducing a new group.
29+
30+
## CI gate
31+
32+
The `benchmarks` job on **ubuntu-latest** runs the full `tests/benchmarks/` suite (`--benchmark-json=benchmark-results.json`), then `scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json`.
33+
34+
- **Fail** when a gated mean exceeds its baseline by **>20%**
35+
- **Fail** when a gated mean is **<50%** of baseline (stale — refresh after intentional speedups)
36+
- **Fail** when a gated baseline name has no current result
37+
- **Warn** for benchmarks without a baseline entry
38+
- All benchmarks listed in `baselines.json` are gated unless named in `EXCLUDED_FROM_GATE` in `scripts/check_benchmark_regression.py`
39+
40+
Pinned runner: `ubuntu-latest`, `--benchmark-min-rounds=5`.
41+
42+
Sub-millisecond benches (e.g. `test_summary_cache_lookup`, `test_composer_map_cache_lookup`) can be high-variance on shared runners. If the gate becomes flaky, raise `--slack` for those entries or add targeted exclusions in `EXCLUDED_FROM_GATE`.
43+
44+
`test_summary_cache_round_trip` is intentionally excluded from the gate: it calls `set_cached_projects` (file write) + `get_cached_projects` (file read) each round, so OS page-cache state on shared runners causes 3–5x variation between consecutive CI runs. The baseline entry is kept for observation only.
45+
46+
## Refresh baselines
47+
48+
After intentional performance work, capture on **ubuntu-latest** (same OS as the gated CI job). Download `benchmark-results.json` from a CI artifact when possible:
49+
50+
```bash
51+
python scripts/reduce_baselines.py benchmark-results.json benchmarks/baselines.json --slack 1.5 --source ubuntu-latest-ci
52+
```
53+
54+
For a quick local snapshot only (may not match CI timings):
55+
56+
```bash
57+
make seed-baselines-local
58+
# writes benchmarks/_raw.json only; does not overwrite benchmarks/baselines.json
59+
make seed-baselines-local FORCE=1 # also runs reduce_baselines into benchmarks/baselines.json
60+
```
61+
62+
`make update-baselines` is a deprecated alias for `seed-baselines-local`. Do not commit baselines from macOS/Windows unless you accept cross-OS gate skew.
63+
64+
## Makefile targets
65+
66+
| Target | Purpose |
67+
|--------|---------|
68+
| `make check-benchmarks` | Run suite + regression gate locally |
69+
| `make seed-baselines-local` | Capture local timings to `benchmarks/_raw.json` (use `FORCE=1` to update `baselines.json`) |
70+
| `make clean-benchmark-artifacts` | Remove `benchmark-results.json` and `benchmarks/_raw.json` |

benchmarks/baselines.json

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,32 @@
11
{
2-
"_note": "Gated means from ubuntu-latest CI benchmark-results.json (PR #120, run 28123677675). Refresh after intentional perf changes: download benchmark-results.json from the CI artifacts job, then `python scripts/check_benchmark_regression.py benchmark-results.json benchmarks/baselines.json` (re-seed with reduce_baselines or edit means). Local capture: `pytest tests/benchmarks/ --benchmark-only --benchmark-json=benchmark-results.json -o addopts=` on ubuntu-latest.",
3-
"updated": "2026-06-24T19:20:27Z",
2+
"_note": "Gated means from ubuntu-latest CI benchmark-results.json. Values multiplied by 1.5x slack at generation time. Excluded from gate (recorded for reference): test_summary_cache_round_trip. Refresh after intentional speedups via reduce_baselines.py.",
3+
"updated": "2026-06-25T23:36:11Z",
44
"machine": "Linux",
55
"groups": {
6+
"parse": {
7+
"test_list_workspace_projects_nocache[composers-10]": 0.016421750017237738,
8+
"test_list_workspace_projects_nocache[composers-50]": 0.07185380692856874,
9+
"test_list_workspace_projects_nocache[composers-200]": 0.2388664538571439
10+
},
11+
"export": {
12+
"test_post_export_zip[composers-10]": 0.010621589857140498,
13+
"test_post_export_zip[composers-50]": 0.03968703356250458
14+
},
15+
"search": {
16+
"test_search_full_corpus_live_scan": 0.04461661563157736,
17+
"test_search_full_corpus_indexed": 0.05512249660713918
18+
},
619
"summary-cache": {
7-
"test_summary_cache_hit": 6.3e-05,
8-
"test_summary_cache_miss": 6.3e-05,
9-
"test_fingerprint_workspace_entries[10]": 0.001844,
10-
"test_fingerprint_workspace_entries[50]": 0.007759,
11-
"test_fingerprint_workspace_entries[200]": 0.022231,
12-
"test_summary_cache_round_trip": 0.000351
20+
"test_summary_cache_lookup[hit]": 7.249851343825762e-05,
21+
"test_summary_cache_lookup[miss]": 7.193702095574013e-05,
22+
"test_composer_map_cache_lookup[hit]": 7.151645086519804e-05,
23+
"test_composer_map_cache_lookup[miss]": 7.112598943352091e-05,
24+
"test_fingerprint_workspace_entries[10]": 0.0024127972424549185,
25+
"test_fingerprint_workspace_entries[50]": 0.010196820941858245,
26+
"test_fingerprint_workspace_entries[200]": 0.029070524094341035,
27+
"test_summary_cache_round_trip": 0.0004703680658560554,
28+
"test_tab_summary_cache_lookup[hit]": 7.844850562859133e-05,
29+
"test_tab_summary_cache_lookup[miss]": 7.843399021512e-05
1330
}
1431
}
1532
}

scripts/check_benchmark_regression.py

Lines changed: 53 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,24 @@
44

55
import argparse
66
import json
7+
import math
78
import sys
89
from pathlib import Path
910

1011
THRESHOLD = 1.20
12+
STALE_FLOOR = 0.50
13+
14+
# Benchmarks recorded in baselines.json but excluded from the regression gate.
15+
# Use sparingly — only for benches whose timing is inherently noisy across CI runs
16+
# (e.g. file I/O operations that depend on OS page-cache state).
17+
EXCLUDED_FROM_GATE: frozenset[str] = frozenset(
18+
{
19+
# round_trip calls set_cached_projects (file write) + get_cached_projects (file read)
20+
# each round. OS page-cache state on shared runners causes 3–5x variation between
21+
# consecutive CI runs, making this ungatable with any reasonable slack.
22+
"test_summary_cache_round_trip",
23+
}
24+
)
1125

1226

1327
class BenchmarkDataError(ValueError):
@@ -97,19 +111,35 @@ def load_baseline_means(baselines_path: str | Path) -> dict[str, float]:
97111
return means
98112

99113

114+
def _validate_gate_ratios(threshold: float, stale_floor: float) -> None:
115+
if not math.isfinite(threshold):
116+
raise BenchmarkDataError("threshold must be finite")
117+
if threshold <= 1:
118+
raise BenchmarkDataError("threshold must be greater than 1")
119+
if not math.isfinite(stale_floor):
120+
raise BenchmarkDataError("stale_floor must be finite")
121+
if not 0 < stale_floor < 1:
122+
raise BenchmarkDataError("stale_floor must be between 0 and 1 (exclusive)")
123+
124+
100125
def check_regression(
101126
results_path: str | Path,
102127
baselines_path: str | Path,
103128
*,
104129
threshold: float = THRESHOLD,
130+
stale_floor: float = STALE_FLOOR,
105131
) -> int:
106-
"""Return 0 when within threshold; 1 when any gated benchmark regresses."""
132+
"""Return 0 when within threshold; 1 when any gated benchmark regresses or is stale."""
133+
_validate_gate_ratios(threshold, stale_floor)
107134
flat = load_results(results_path)
108135
baseline_means = load_baseline_means(baselines_path)
109136

110137
failures: list[str] = []
138+
stale: list[str] = []
111139
missing: list[str] = []
112140
for name, base in baseline_means.items():
141+
if name in EXCLUDED_FROM_GATE:
142+
continue
113143
cur = flat.get(name)
114144
if cur is None:
115145
print(f"FAIL: no current result for gated baseline {name!r}")
@@ -119,20 +149,32 @@ def check_regression(
119149
print(f"WARN: baseline for {name!r} is zero; skipping ratio check")
120150
continue
121151
ratio = cur / base
122-
tag = "FAIL" if ratio > threshold else "ok"
123-
print(f"[{tag}] {name}: {cur:.6f}s vs {base:.6f}s ({ratio:.2f}x)")
124152
if ratio > threshold:
153+
tag = "FAIL"
125154
failures.append(name)
155+
elif ratio < stale_floor:
156+
tag = "STALE"
157+
stale.append(name)
158+
else:
159+
tag = "ok"
160+
print(f"[{tag}] {name}: {cur:.6f}s vs {base:.6f}s ({ratio:.2f}x)")
126161

127162
for name in flat:
163+
if name in EXCLUDED_FROM_GATE:
164+
continue
128165
if name not in baseline_means:
129166
print(f"WARN: {name!r} has no baseline yet; not gated")
130167

131168
if failures:
132169
print(f"\nREGRESSION: {len(failures)} benchmark(s) exceeded {threshold:.0%}")
170+
if stale:
171+
print(
172+
f"\nSTALE: {len(stale)} benchmark(s) are faster than {stale_floor:.0%} of baseline "
173+
"(refresh baselines after intentional speedups)"
174+
)
133175
if missing:
134176
print(f"\nMISSING: {len(missing)} gated benchmark(s) absent from current results")
135-
if failures or missing:
177+
if failures or stale or missing:
136178
return 1
137179
return 0
138180

@@ -147,12 +189,19 @@ def main(argv: list[str] | None = None) -> int:
147189
default=THRESHOLD,
148190
help="fail when current mean exceeds baseline by more than this ratio (default: 1.20)",
149191
)
192+
parser.add_argument(
193+
"--stale-floor",
194+
type=float,
195+
default=STALE_FLOOR,
196+
help="fail when current mean is below this fraction of baseline (default: 0.50)",
197+
)
150198
args = parser.parse_args(argv)
151199
try:
152200
return check_regression(
153201
args.results_path,
154202
args.baselines_path,
155203
threshold=args.threshold,
204+
stale_floor=args.stale_floor,
156205
)
157206
except BenchmarkDataError as exc:
158207
print(f"ERROR: {exc}", file=sys.stderr)

0 commit comments

Comments
 (0)