Skip to content

feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix#4

Open
ser-vasilich wants to merge 6 commits into
RayforceDB:masterfrom
ser-vasilich:prototype
Open

feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix#4
ser-vasilich wants to merge 6 commits into
RayforceDB:masterfrom
ser-vasilich:prototype

Conversation

@ser-vasilich
Copy link
Copy Markdown
Contributor

@ser-vasilich ser-vasilich commented May 15, 2026

Summary

61 commits accumulating the canonical H2O (h2oai/db-benchmark) coverage on rayforce-bench: engine-only timing for SQL adapters, rayforce wrappers for q6/q8/q9, fairness fixes across adapters, and dashboard polish.

Headline changes

Engine-only timing across SQL adapters (`20f915a`)

Replace `fetchall()` / IPC-materialization with server-side draining or `CREATE TEMPORARY TABLE` patterns so each adapter is timed on engine work only, not Arrow IPC / Python conversion. Affects DuckDB, chDB, DataFusion, QuestDB, TimescaleDB.

Rayforce q6 / q8 / q9 adapters (`a50ab48`, `611bcb3`, `626cd34`, `99ae025`)

  • q6 (median + stddev by id4,id5): native `Column.median()` + `Column.std()` via new engine `OP_MEDIAN` and existing stddev
  • q8 (largest 2 v3 by id6): `Column.top(2)` via engine `OP_TOP_N` then `OP_GROUP_TOPK_ROWFORM` (row-form emit, no LIST intermediate)
  • q9 (pearson² by id2,id4): two-stage adapter — `Column.pearson_corr(...)` then arithmetic squaring; required because `** 2` at top would block the DAG hash-agg lowering

Engine-side explode for q8 (raze + indexed gather) keeps the timed query in row form (200k rows) — matches DuckDB's `ROW_NUMBER OVER PARTITION` shape and SQL adapters' default materialization.

DataFusion memtable fix (`eae3261`)

`register_csv` produced a listing table that re-parsed CSV on every timed query (page cache avoided disk, but parse cost remained). Replaced with `register_record_batches` after one-shot `collect()`. Apples-to-apples vs duckdb/chdb/polars/pandas/rayforce which all hold native columnar storage. q4 154→17 ms, q6 312→148 ms, q8 367→262 ms.

Dashboard / framework polish (multiple)

  • Canonical H2O suite (groupby q1..q10 + canonical-join q1..q5 + sort_single/sort_multi)
  • Bonus suite (3-key joins, full-row sorts) under separate `bench-bonus` target
  • Per-adapter QUERY_STRINGS shown on the compare panel
  • Scaling sweep with operations panel split into groupby/join/sort
  • Histogram split fast/heavy
  • `make check` — cross-adapter result equivalence at all sizes 10..10m
  • Bench snapshot refreshed after `OP_GROUP_TOPK_ROWFORM` (`d354496`)

Perf snapshot (10M rows, k=100 cardinality, engine-only timing)

query rayforce duckdb datafusion polars
q1 5.5 ms 37 ms 17 ms 30 ms
q4 10 ms 9 ms 19 ms 29 ms
q6 121 ms 186 ms 148 ms 254 ms
q8 40 ms 157 ms 258 ms 503 ms
q9 49 ms 76 ms 72 ms 405 ms
q10 164 ms 380 ms 419 ms 1961 ms

Rayforce wins 9/10 (q4 within 1ms of duckdb).

Related

Test plan

  • `make check LOCAL=1` → `pass — 665/665 comparisons matched polars, 0 NYI (rtol=1e-06, atol=1e-09)` across all 7 sizes × all 19 ops × all 6 adapters
  • `make bench LOCAL=1` reproduces the perf numbers above
  • Reviewer: build with companion branches (`RAYFORCE_LOCAL_PATH` pointing at rayforce#203 checkout) and re-run `make check` + `make bench`

ser-vasilich and others added 6 commits May 10, 2026 19:30
…karounds

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
register_csv produces a listing table that re-parses CSV on every
timed query.  register_record_batches with the collected batches
caches the columnar layout in memory.  q4 154→17ms, q6 312→148ms,
q8 367→262ms — DataFusion now apples-to-apples with adapters that
hold native columnar storage.
q8's natural rayforce shape is 100k rows with LIST<F64>[2] cells —
duckdb's ROW_NUMBER() <= 2 SQL emits 200k exploded rows.  Timed bench
was unfair: rayforce skipped the row-materialisation cost SQL
adapters pay for.  Move the explode into the timed engine query
via raze + indexed gather (vectorised, no per-element lambda) so
both sides materialise 200k.  q8 163ms (100k rows) → 215ms (200k
rows) vs duckdb 198ms — ~apples-to-apples now.

Bundles the q9 two-stage adapter form already in the working tree.
run_groupby_q8's fast vectorised explode assumes K=2 everywhere (true
for canonical 10m k100, where every id6 group has ≥2 non-null v3).
Small check sizes (10..1m) hit groups with K=1 cells; the K=2-uniform
formula produces row-count mismatch.  Split: timed path keeps the
fast formula; materialize() reverts to a per-cell Python explode for
correctness across all check sizes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant