feat: canonical H2O groundwork — top/bot/pearson_corr/pow + 4 fixes by ser-vasilich · Pull Request #202 · RayforceDB/rayforce

ser-vasilich · 2026-05-15T20:06:02Z

Summary

Foundational primitives and bug fixes required to run the canonical H2O benchmark suite. 9 commits. No perf work here — that lands in a follow-up PR against this branch.

Aggregate / arithmetic primitives

`feat(sort): add top / bot — partial top-N / bottom-N` (3a9d70f)
`feat(agg): add pearson_corr — Pearson correlation coefficient` (8f974a6)
`feat(arith): add pow as binary atomic` (a4bdba6)

Bug fixes uncovered while building the canonical H2O suite

`fix(collection): atom_eq RAY_LIST does structural compare, not memcmp on pointers` (bf7d560) — composite-LIST group keys never matched
`fix(query): multi-key + non-agg routes through eval-level group` (bbd2c72)
`fix(query): nested aggregates in non-agg expr evaluate per group` (2647ea6) — q7 `(- (max v1) (min v2))` was collapsing globally
`fix(group): O(1) lookup + grow for LIST path` (3c6e5c0) + regression test (27a85ea) — >1024 unique LIST keys

Test + bench files

`bench(h2o): add q9 — pearson² per group (id2, id4)` (4d6d9ae)

feat(types): canonical H2O column wrappers — top/bot/pearson_corr/std/__pow__ rayforce-py#12 — Python wrappers (`.top`, `.bot`, `.pearson_corr`, `.std`, `.median`, `pow`)
Follow-up engine PR (`feat/canonical-h2o-median` → this branch) — perf work for q6/q8/q9 (OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM)
RayforceDB/rayforce-bench PR — bench adapter using these primitives

Test plan

`make test` passes (2406/2407, 1 skipped)
Reviewer: `make check LOCAL=1` in rayforce-bench → `pass — 665/665` (requires all 3 companion branches checked out)

… on pointers For RAY_LIST a->type fell through atom_eq's default branch which does memcmp on ray_data — i.e. on the ray_t** pointer array, not the elements. Two structurally-identical lists with different element pointers (the common case after construction) compared not-equal, silently breaking ray_group_fn / ray_dict / distinct fallback for any code that built composite-list keys. Concretely: (group (list (list 1 2) (list 1 2) (list 3 4))) returned three buckets instead of two, and the eval-level multi-key group-by path (used for non-agg expressions) put every row in its own group. Add a RAY_LIST case that recurses element-wise. Vector LIST keys are still bounded by ngroups (caller-side). Tests in test/test_atom.c cover: - basic same-shape compare across different pointers - mixed-type elements (i64 + f64 + str) - nested LIST-of-LIST - per-element null short-circuit - empty lists - sym-atom rows (the q6 multi-key composite-key shape)

Two related bugs blocked canonical H2O groupby queries on multi-key by-clauses: 1. The planner had a guard that rejected non-agg expressions with multi-key by outright (nyi error). The eval-level multi-key path already implements grouping correctly — drop the guard and let the path take it. Closes q6 (median + multi-key) and the multi-key shape of q7 (arith-of-aggregates). 2. bind_col_slice resolved per-group slices via ray_at_fn, which boxes (typed-vec idx-vec) into a RAY_LIST of atoms. desc/asc/ take then refused with "type: desc expects a vector". Slice directly via gather_by_idx for typed-vec + I64-idx-vec; fall back to ray_at_fn for LIST inputs and other shapes the gather kernel doesn't cover. Unblocks q8 (per-group top-N via `(take (desc v) n)`). Single-key (- (max v1) (min v2)) by id3 still broadcasts global — that path goes through the DAG fast-scatter, not eval, and the arith-of-aggregates handling there is a separate fix. Updated test/test_lang.c::test_eval_select_by_multi_nonagg from asserting the nyi error to asserting the new working behaviour, and added test/rfl/integration/canonical_h2o.rfl with q6 / q7 (multi-key shape) / q8 / atom_eq composite-key regressions.

(- (max v1) (min v2)) inside a single-key (by:) projection collapsed both inner aggregates globally — the post-DAG scatter ran one full- table ray_eval, got an atom (= global max - global min), then broadcast that atom to every group cell. The classifier expr_refs_row_column short-circuits on is_agg_expr subtrees because aggregating a column collapses it to a scalar. That's correct for `(max col)` standalone but masks "non-agg outer + agg inner" shapes from the row-alignment check, which then takes the constant-broadcast branch. Add expr_contains_agg (recursive walk) and route any non-agg expr that contains an aggregate subexpression through nonagg_eval_per_group_buf — the same per-group eval path the eval- level multi-key fix uses. Each nested agg now reduces inside its group's slice. Updates test_eval_select_by_nonagg_with_agg_subexpr from asserting the old broadcast (m=[211,211]) to the canonical SQL/k semantic (m=[91,121] for (+ 1 (sum p))), and adds a single-key q7 case in test/rfl/integration/canonical_h2o.rfl.

Engine has sqrt/log/exp but no pow. Needed for q9 (pearson_corr manual reconstruction) and generally useful — closes the gap with polars/numpy/pandas Column.pow(). Returns F64 regardless of input types; libm pow() handles fractional exponents (e.g. (pow 2 0.5) → 1.41…). Null in either operand propagates to typed F64 null. Registered as RAY_FN_ATOMIC so vec broadcasts go through the existing per-element dispatch — no DAG opcode yet (perf follow-up). Tests in test/rfl/arith/pow.rfl cover atom/atom, vec/atom, atom/vec, vec/vec, null propagation, the (pow x 2) ≡ (* x x) identity, the pow-then-root round-trip, and type-error paths.

(top v n) returns the n largest values from v in descending order; (bot v n) returns the n smallest in ascending order. Per-group use inside select closes q8 (largest-N per id6) without sorting the full group: (select {top2: (top v 2) by: id6 from: t}) Implementation routes through topk_indices_single — the same bounded-heap O(N log K) path that powers ray_topk_table, falling back to ray_desc/ray_asc + take for STR/GUID/LIST/SYM and the n>=len edge case. Output type matches the input type. Tests in test/rfl/arith/top_bot.rfl cover narrow ints (I16/I32/U8), F64, negative values, n-edge cases (0, len, > len, negative), the (top v 1) == (max v) and (top v len) == (desc v) identities, the prefix invariant against full sort, per-group usage with single and multi-key by-clauses, and the type-error path.

(pearson_corr x y) returns the Pearson correlation coefficient between two numeric vectors of equal length: r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²)) Single-pass formulation with F64 accumulators; nulls in either side skip the row from BOTH sums (pairwise complete-case deletion, matching polars / pandas pearson_corr default). Returns F64 NaN when n < 2 or either column has zero variance (correlation undefined). Per-group usage routes through the eval-level scatter — the planner sees a non-agg expression with column refs that collapses to a scalar on the full table, and the non-row-aligned fallback re-runs per group. This unblocks q9 of the canonical H2O benchmark: (select {r2: (pow (pearson_corr v1 v2) 2) by: {id2 id4} from: df}) Tests in test/rfl/agg/pearson_corr.rfl cover perfect ±1 cases, F64 return type, narrow integer coercion (I32/I16/U8), n<2 / zero-variance NaN paths, symmetry r(x,y)==r(y,x), self-correlation == 1.0, the |r| <= 1.0 bound, error paths (length / type / non-numeric), and the canonical q9 single-key + multi-key shapes end-to-end. First-pass implementation goes through collection_elem + as_f64 for type-agnostic numeric reads — type-specialised inner loops are a perf follow-up.

The LIST path in ray_group_fn historically capped unique groups at the initial 1024-slot kblock with no resizing — once `ngroups >= max_groups` it returned `error: limit`. Multi-key non-agg select-by builds a composite-LIST-of-LISTs key and routes through this path; on H2O K=100 datasets the cartesian product of two 100-cardinality keys reaches up to 10k groups, so the cap fires on real workloads. Add direct-call coverage: (group <2000 unique 2-element list keys>) plus a select-shaped variant on a 1500-row table whose (k1, k2) pairs are all unique. Both fail with `error: limit` against the unfixed cap; the fix in the next commit makes them pass.

ray_group_fn's RAY_LIST branch had two coupled limitations that bit multi-key non-agg select-by on H2O-scale workloads: 1. Hard 1024 cap on unique groups (`max_groups = n < 1024 ? n : 1024`, no resizing) — returned `error: limit` once exceeded. 2. O(N²) linear scan: every row probed every existing group key via atom_eq. At 10M rows × 10k groups this is ~10^11 atom_eq calls — 8+ minutes per call before the cap was hit; now after lifting the cap it'd be that slow on real data. Replace the linear scan with an open-addressed hash table on atom_hash, mirroring the scalar / RAY_GUID paths. atom_hash is a new helper that walks an atom recursively and produces a hash consistent with atom_eq's structural compare — composite multi-key composites hash by combining their cell hashes via ray_hash_combine, so two [A, 7] composites collide and the equality check on the slot disambiguates. Existing patterns this aligns with: - ray_hash_* from ops/hash.h (wyhash) — same as pivot.c, datalog.c, join.c, collection.c::hs_hash_row. - group_ht_t open-addressing — same shape as the GUID and scalar paths in the same function (group_ht_init / _grow / _free, GHT_EMPTY sentinel, load factor 0.5 grow trigger). - group_grow_listkeys mirrors group_grow but also resizes the ray_t* keys block; replaces the previous limit-error. Note collection.c::hs_hash_row's RAY_LIST branch handles atom kinds at one level only — its default case folds nested-list rows to the same hash, so distinct/intersect over list-of-lists is also degenerate. That's outside this fix's scope; this commit only changes ray_group_fn. Measured on bench/h2o/q9.rfl (G1_1e7_1e2, 10M rows, by {id2 id4} → 10000 groups, pearson_corr v1 v2): pre-fix (cap): error: limit after 2.6s pre-fix (cap-grow): 484.7s per query (linear scan) this fix (HT): 4.8s per query — ~100× faster Makes the test added in 27a85ea pass.

Canonical H2O benchmark q9: regression metric per (id2, id4) group. Polars reference: df.groupby(["id2","id4"]).agg((pl.pearson_corr("v1","v2")**2).alias("r2")) Engine equivalent: (select {r2: (pow (pearson_corr v1 v2) 2) by: {id2: id2 id4: id4} from: df}) Closes the q9 gap in REQUIREMENTS_CANONICAL_H2O.md — needed pearson_corr (added 8f974a6), pow (a4bdba6), and the group LIST- path hash fix (3c6e5c0) to run end-to-end on the K=100 dataset (100×100 = 10k unique groups exceeds the historical 1024 cap). Same harness shape as q1/q2/q3/q5/q7: 3 warmup iterations, 5 timed runs via timeit, exit.

# Conflicts: # src/ops/builtins.c

ser-vasilich added 9 commits May 8, 2026 17:44

This was referenced May 15, 2026

perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM #203

Open

feat: canonical H2O coverage — q6/q8/q9 adapters + engine-only timing + DataFusion memtable fix RayforceDB/rayforce-bench#4

Open

Merge remote-tracking branch 'origin/master' into feat/canonical-h2o

0ab4b93

# Conflicts: # src/ops/builtins.c

This was referenced May 15, 2026

feat(types): canonical H2O column wrappers — top/bot/pearson_corr/std/__pow__ RayforceDB/rayforce-py#11

Closed

feat(types): canonical H2O column wrappers — top/bot/pearson_corr/std/__pow__ RayforceDB/rayforce-py#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: canonical H2O groundwork — top/bot/pearson_corr/pow + 4 fixes#202

feat: canonical H2O groundwork — top/bot/pearson_corr/pow + 4 fixes#202
ser-vasilich wants to merge 10 commits into
masterfrom
feat/canonical-h2o

ser-vasilich commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ser-vasilich commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Aggregate / arithmetic primitives

Bug fixes uncovered while building the canonical H2O suite

Test + bench files

Related

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ser-vasilich commented May 15, 2026 •

edited

Loading