Skip to content

feat: canonical H2O groundwork — top/bot/pearson_corr/pow + 4 fixes#202

Open
ser-vasilich wants to merge 10 commits into
masterfrom
feat/canonical-h2o
Open

feat: canonical H2O groundwork — top/bot/pearson_corr/pow + 4 fixes#202
ser-vasilich wants to merge 10 commits into
masterfrom
feat/canonical-h2o

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

@ser-vasilich ser-vasilich commented May 15, 2026

Summary

Foundational primitives and bug fixes required to run the canonical H2O benchmark suite. 9 commits. No perf work here — that lands in a follow-up PR against this branch.

Aggregate / arithmetic primitives

  • `feat(sort): add top / bot — partial top-N / bottom-N` (3a9d70f)
  • `feat(agg): add pearson_corr — Pearson correlation coefficient` (8f974a6)
  • `feat(arith): add pow as binary atomic` (a4bdba6)

Bug fixes uncovered while building the canonical H2O suite

  • `fix(collection): atom_eq RAY_LIST does structural compare, not memcmp on pointers` (bf7d560) — composite-LIST group keys never matched
  • `fix(query): multi-key + non-agg routes through eval-level group` (bbd2c72)
  • `fix(query): nested aggregates in non-agg expr evaluate per group` (2647ea6) — q7 `(- (max v1) (min v2))` was collapsing globally
  • `fix(group): O(1) lookup + grow for LIST path` (3c6e5c0) + regression test (27a85ea) — >1024 unique LIST keys

Test + bench files

  • `bench(h2o): add q9 — pearson² per group (id2, id4)` (4d6d9ae)

Related

Companion PRs:

Test plan

  • `make test` passes (2406/2407, 1 skipped)
  • Reviewer: `make check LOCAL=1` in rayforce-bench → `pass — 665/665` (requires all 3 companion branches checked out)

… on pointers

For RAY_LIST a->type fell through atom_eq's default branch which does
memcmp on ray_data — i.e. on the ray_t** pointer array, not the
elements.  Two structurally-identical lists with different element
pointers (the common case after construction) compared not-equal,
silently breaking ray_group_fn / ray_dict / distinct fallback for any
code that built composite-list keys.  Concretely: (group (list (list
1 2) (list 1 2) (list 3 4))) returned three buckets instead of two,
and the eval-level multi-key group-by path (used for non-agg
expressions) put every row in its own group.

Add a RAY_LIST case that recurses element-wise.  Vector LIST keys
are still bounded by ngroups (caller-side).

Tests in test/test_atom.c cover:
  - basic same-shape compare across different pointers
  - mixed-type elements (i64 + f64 + str)
  - nested LIST-of-LIST
  - per-element null short-circuit
  - empty lists
  - sym-atom rows (the q6 multi-key composite-key shape)
Two related bugs blocked canonical H2O groupby queries on multi-key
by-clauses:

  1. The planner had a guard that rejected non-agg expressions with
     multi-key by outright (nyi error).  The eval-level multi-key
     path already implements grouping correctly — drop the guard and
     let the path take it.  Closes q6 (median + multi-key) and the
     multi-key shape of q7 (arith-of-aggregates).

  2. bind_col_slice resolved per-group slices via ray_at_fn, which
     boxes (typed-vec idx-vec) into a RAY_LIST of atoms.  desc/asc/
     take then refused with "type: desc expects a vector".  Slice
     directly via gather_by_idx for typed-vec + I64-idx-vec; fall
     back to ray_at_fn for LIST inputs and other shapes the gather
     kernel doesn't cover.  Unblocks q8 (per-group top-N via
     `(take (desc v) n)`).

Single-key (- (max v1) (min v2)) by id3 still broadcasts global —
that path goes through the DAG fast-scatter, not eval, and the
arith-of-aggregates handling there is a separate fix.

Updated test/test_lang.c::test_eval_select_by_multi_nonagg from
asserting the nyi error to asserting the new working behaviour, and
added test/rfl/integration/canonical_h2o.rfl with q6 / q7 (multi-key
shape) / q8 / atom_eq composite-key regressions.
(- (max v1) (min v2)) inside a single-key (by:) projection collapsed
both inner aggregates globally — the post-DAG scatter ran one full-
table ray_eval, got an atom (= global max - global min), then
broadcast that atom to every group cell.

The classifier expr_refs_row_column short-circuits on is_agg_expr
subtrees because aggregating a column collapses it to a scalar.
That's correct for `(max col)` standalone but masks "non-agg outer
+ agg inner" shapes from the row-alignment check, which then takes
the constant-broadcast branch.

Add expr_contains_agg (recursive walk) and route any non-agg expr
that contains an aggregate subexpression through
nonagg_eval_per_group_buf — the same per-group eval path the eval-
level multi-key fix uses.  Each nested agg now reduces inside its
group's slice.

Updates test_eval_select_by_nonagg_with_agg_subexpr from asserting
the old broadcast (m=[211,211]) to the canonical SQL/k semantic
(m=[91,121] for (+ 1 (sum p))), and adds a single-key q7 case in
test/rfl/integration/canonical_h2o.rfl.
Engine has sqrt/log/exp but no pow.  Needed for q9 (pearson_corr
manual reconstruction) and generally useful — closes the gap with
polars/numpy/pandas Column.pow().

Returns F64 regardless of input types; libm pow() handles fractional
exponents (e.g. (pow 2 0.5) → 1.41…).  Null in either operand
propagates to typed F64 null.  Registered as RAY_FN_ATOMIC so vec
broadcasts go through the existing per-element dispatch — no DAG
opcode yet (perf follow-up).

Tests in test/rfl/arith/pow.rfl cover atom/atom, vec/atom, atom/vec,
vec/vec, null propagation, the (pow x 2) ≡ (* x x) identity, the
pow-then-root round-trip, and type-error paths.
(top v n) returns the n largest values from v in descending order;
(bot v n) returns the n smallest in ascending order.  Per-group use
inside select closes q8 (largest-N per id6) without sorting the full
group:

  (select {top2: (top v 2) by: id6 from: t})

Implementation routes through topk_indices_single — the same
bounded-heap O(N log K) path that powers ray_topk_table, falling
back to ray_desc/ray_asc + take for STR/GUID/LIST/SYM and the n>=len
edge case.  Output type matches the input type.

Tests in test/rfl/arith/top_bot.rfl cover narrow ints (I16/I32/U8),
F64, negative values, n-edge cases (0, len, > len, negative), the
(top v 1) == (max v) and (top v len) == (desc v) identities, the
prefix invariant against full sort, per-group usage with single and
multi-key by-clauses, and the type-error path.
(pearson_corr x y) returns the Pearson correlation coefficient between
two numeric vectors of equal length:

  r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²))

Single-pass formulation with F64 accumulators; nulls in either side
skip the row from BOTH sums (pairwise complete-case deletion, matching
polars / pandas pearson_corr default).  Returns F64 NaN when n < 2 or
either column has zero variance (correlation undefined).

Per-group usage routes through the eval-level scatter — the planner
sees a non-agg expression with column refs that collapses to a scalar
on the full table, and the non-row-aligned fallback re-runs per group.
This unblocks q9 of the canonical H2O benchmark:

  (select {r2: (pow (pearson_corr v1 v2) 2) by: {id2 id4} from: df})

Tests in test/rfl/agg/pearson_corr.rfl cover perfect ±1 cases, F64
return type, narrow integer coercion (I32/I16/U8), n<2 / zero-variance
NaN paths, symmetry r(x,y)==r(y,x), self-correlation == 1.0, the
|r| <= 1.0 bound, error paths (length / type / non-numeric), and the
canonical q9 single-key + multi-key shapes end-to-end.

First-pass implementation goes through collection_elem + as_f64 for
type-agnostic numeric reads — type-specialised inner loops are a perf
follow-up.
The LIST path in ray_group_fn historically capped unique groups at the
initial 1024-slot kblock with no resizing — once `ngroups >= max_groups`
it returned `error: limit`.  Multi-key non-agg select-by builds a
composite-LIST-of-LISTs key and routes through this path; on H2O K=100
datasets the cartesian product of two 100-cardinality keys reaches up
to 10k groups, so the cap fires on real workloads.

Add direct-call coverage:
  (group <2000 unique 2-element list keys>)
plus a select-shaped variant on a 1500-row table whose (k1, k2) pairs
are all unique.  Both fail with `error: limit` against the unfixed
cap; the fix in the next commit makes them pass.
ray_group_fn's RAY_LIST branch had two coupled limitations that bit
multi-key non-agg select-by on H2O-scale workloads:

  1. Hard 1024 cap on unique groups (`max_groups = n < 1024 ? n : 1024`,
     no resizing) — returned `error: limit` once exceeded.
  2. O(N²) linear scan: every row probed every existing group key via
     atom_eq.  At 10M rows × 10k groups this is ~10^11 atom_eq calls —
     8+ minutes per call before the cap was hit; now after lifting the
     cap it'd be that slow on real data.

Replace the linear scan with an open-addressed hash table on
atom_hash, mirroring the scalar / RAY_GUID paths.  atom_hash is a new
helper that walks an atom recursively and produces a hash consistent
with atom_eq's structural compare — composite multi-key composites
hash by combining their cell hashes via ray_hash_combine, so two
[A, 7] composites collide and the equality check on the slot
disambiguates.

Existing patterns this aligns with:
  - ray_hash_* from ops/hash.h (wyhash) — same as pivot.c, datalog.c,
    join.c, collection.c::hs_hash_row.
  - group_ht_t open-addressing — same shape as the GUID and scalar
    paths in the same function (group_ht_init / _grow / _free, GHT_EMPTY
    sentinel, load factor 0.5 grow trigger).
  - group_grow_listkeys mirrors group_grow but also resizes the
    ray_t* keys block; replaces the previous limit-error.

Note collection.c::hs_hash_row's RAY_LIST branch handles atom kinds at
one level only — its default case folds nested-list rows to the same
hash, so distinct/intersect over list-of-lists is also degenerate.
That's outside this fix's scope; this commit only changes ray_group_fn.

Measured on bench/h2o/q9.rfl (G1_1e7_1e2, 10M rows, by {id2 id4} →
10000 groups, pearson_corr v1 v2):

  pre-fix (cap):       error: limit after 2.6s
  pre-fix (cap-grow):  484.7s per query (linear scan)
  this fix (HT):         4.8s per query — ~100× faster

Makes the test added in 27a85ea pass.
Canonical H2O benchmark q9: regression metric per (id2, id4) group.
Polars reference:
  df.groupby(["id2","id4"]).agg((pl.pearson_corr("v1","v2")**2).alias("r2"))

Engine equivalent:
  (select {r2: (pow (pearson_corr v1 v2) 2) by: {id2: id2 id4: id4}
           from: df})

Closes the q9 gap in REQUIREMENTS_CANONICAL_H2O.md — needed
pearson_corr (added 8f974a6), pow (a4bdba6), and the group LIST-
path hash fix (3c6e5c0) to run end-to-end on the K=100 dataset
(100×100 = 10k unique groups exceeds the historical 1024 cap).

Same harness shape as q1/q2/q3/q5/q7: 3 warmup iterations, 5 timed
runs via timeit, exit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant