Skip to content

perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203

Open
ser-vasilich wants to merge 17 commits into
feat/canonical-h2ofrom
feat/canonical-h2o-median
Open

perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203
ser-vasilich wants to merge 17 commits into
feat/canonical-h2ofrom
feat/canonical-h2o-median

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

@ser-vasilich ser-vasilich commented May 15, 2026

Summary

17 commits closing the perf gap to DuckDB on canonical H2O queries q6 (median+stddev), q8 (top-2 per group), q9 (pearson²). New DAG opcodes + dedicated post-radix passes + supporting infra fixes.

Stacks on top of #202.

New opcodes (DAG-routed aggregates)

query opcode algorithm
q9 (pearson_corr) `OP_PEARSON_CORR` vectorized single-pass hash-agg (n·Σxy − Σx·Σy) / sqrt(...) inline in radix HT
q6 (median) `OP_MEDIAN` post-radix bucket-scatter (idx_buf from row_gid+grp_cnt) + per-group parallel quickselect
q8 (top-N) `OP_TOP_N` / `OP_BOT_N` post-radix bounded-heap per group → LIST[K] cells
q8 (alternate fast path) `OP_GROUP_TOPK_ROWFORM` dedicated morsel-driven scatter + L2-hot per-partition HTs with inline K-slot heap → direct row-form emit (skips radix HT, idx_buf, LIST intermediate)

Supporting fixes

  • `perf(raze): O(N) fast path for same-typed numeric vectors` (bed7f7a) — was O(N²) pairwise concat
  • `fix(group): per-group dispatch survives n_groups > 65536` (f2b5f15) — `ray_pool_dispatch_n` silently clamps tasks at MAX_RING_CAP=65536
  • `perf(group): cap histscat tasks at worker count` (9771a01) — hist matrix sized [n_tasks × n_groups] was 1GB+ for 100k groups; cap to ~n_workers

Perf numbers (10M rows, k=100 cardinality)

query rayforce before rayforce after duckdb ratio
q6 ~5550 ms (eval-fallback) 120 ms 178 ms 1.5× faster
q8 ~400 ms (eval-fallback) → 215 ms (DAG + explode) 40 ms 157 ms 4× faster
q9 ~7000 ms (eval-fallback) → 49 ms 49 ms 78 ms 1.6× faster

q8 win comes from the dedicated `OP_GROUP_TOPK_ROWFORM` operator: per-worker scatter into (worker, partition) buffers in phase 1; phase 2 builds L2-hot per-partition HTs (~400 entries each, fits L1) with inline K-slot bounded-heap and emits row-form directly. Skips the radix HT, idx_buf materialization, LIST[K] intermediate, and explode step. Pattern follows Siddiqui et al. VLDB 2024 ("Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets").

Related

Test plan

  • `make test` passes (2406/2407, 1 skipped, 0 failed)
  • Cross-adapter correctness via `make check LOCAL=1` → `pass — 665/665 comparisons matched polars, 0 NYI`
  • Bench numbers above reproduced with rayforce-bench harness at 10M k=100 (3 warmup + 7 iters)

Known limitations

  • Parted columns + OP_GROUP_TOPK_ROWFORM: the new row-form top/bot operator reads source columns via ray_data(col) assuming contiguous data. If col is parted (segments via .csv.parted / .csv.splayed loaders introduced in #5944632e), ray_data returns the segment-array pointer, not contiguous values. Planner gate currently does NOT check RAY_IS_PARTED(col->type) before routing to this opcode. Current bench (Table.from_csv regular path) is not affected. Follow-up needed before parted-column production use: either add a parted-aware fallback to OP_GROUP in the planner gate, or extend the kernel to iterate per-segment.

ser-vasilich and others added 17 commits May 16, 2026 01:25
Foundation only — group.c hash-agg path not yet implemented.  ray_group2
+ OP_PEARSON_CORR DAG nodes are emitted by the planner for `(select
(pearson_corr x y) by ...)` shapes, but exec_group will panic on the
unknown opcode until Phase B lands.

Files:
- src/ops/ops.h: OP_PEARSON_CORR=79, agg_ins2 field in OP_GROUP ext
- src/ops/internal.h: GHT_NEED_PEARSON, off_sum_y/off_sumsq_y/off_sumxy,
  agg_is_binary in ght_layout_t
- src/lang/eval.c: pearson_corr promoted to RAY_FN_AGGR | RAY_FN_LAZY_AWARE
- src/ops/graph.c: ray_pearson_corr DAG-builder, ray_group2 (variant
  accepting agg_ins2 sibling array), pointer-fixup for agg_ins2
- src/ops/query.c: resolve_agg_opcode("pearson_corr"); two planner sites
  collect agg_ins2 and dispatch to ray_group2 when any agg is binary
- src/ops/dump.c + test/test_dump.c: opcode name "PEARSON_CORR"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CORR

Additive changes only — compiles cleanly, no behavioural impact for
existing code paths (no agg uses GHT_NEED_PEARSON yet because the
phase1 packing, accumulator update, and phase3 finalize sites are
still to-do).

- ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set
  agg_is_binary bit, allocate two consecutive agg_vals slots per binary
  agg (x at s, y at s+1), allocate off_sum_y/off_sumsq_y/off_sumxy
  blocks when GHT_NEED_PEARSON is set.
- ht_path ght_need computation: OP_PEARSON_CORR sets SUM | SUMSQ |
  PEARSON.

Remaining Phase B sites (chain is interdependent — must land together):
  * agg input resolution: read ext->agg_ins2[a] → agg_vecs2[a]
  * radix_phase1_ctx_t.agg_vecs2 + dispatch ctx plumbing
  * radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals
  * init_accum_from_entry + accum_from_entry: write Σy, Σy², Σxy
  * radix phase3 finalize: OP_PEARSON_CORR arm → r = (n·Σxy − Σx·Σy) /
    sqrt((n·Σx² − Σx²)(n·Σy² − Σy²))
  * dense-array bypass: route OP_PEARSON_CORR → ht_path
  * exec.c scalar dispatch (n_keys=0) or lower to OP_GROUP

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires OP_PEARSON_CORR into the radix-partitioned + single-HT group-by
pipeline.  Single-pass two-moments formula matches ray_pearson_corr_fn
(see comment).  All 2406 existing tests pass; pearson_corr.rfl groupby
+ multi-key cases pass through the new opcode path.

Touch list:
- ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set
  agg_is_binary bit, reserve 2 consecutive agg_vals slots per binary
  agg (x at s, y at s+1); allocate off_sum_y/off_sumsq_y/off_sumxy
  blocks when GHT_NEED_PEARSON.
- ht_path ght_need: OP_PEARSON_CORR → SUM|SUMSQ|PEARSON.
- Agg input resolution: read ext->agg_ins2[a] via the same OP_SCAN /
  OP_CONST / expr_compile ladder used for the x-side.
- All 7 agg_vecs cleanup sites: release agg_vecs2[a] alongside.
- radix_phase1_ctx_t: new agg_vecs2 field, plumbed through both
  call sites + single-HT group_rows_range signature update.
- radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals.
- init_accum_from_entry: seed Σy, Σy², Σxy (both f64 and i64 inputs).
- accum_from_entry: incremental update of Σy, Σy², Σxy in both branches.
- Radix phase-3 finalize: OP_PEARSON_CORR arm —
    r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²))
  Emits NaN for n<2 or constant-side; canonicalize folds → null.
- Dense-array bypass: OP_PEARSON_CORR forces ht_path (da_accum_t
  doesn't have per-worker y-side state yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OP_PEARSON_CORR to two more finalize sites missed in the earlier
Phase B pass: the single-HT (non-radix) path's per-group emit at
group.c:4915 and the two out_type switches at 4644/4861.  Without
these the single-HT code path falls through to `default: v = 0.0`
which is why `make check` saw r²=0 instead of 1.0 for groups where
n>=2 but the planner chose single-HT over radix.

Still WIP — q9 bench at 10m hasn't been re-run since this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dbl_inplace

Adds aggr_med_per_group_buf in query.c that recognises `(med col)` in
the eval-fallback path and replaces the per-group ray_at_fn slice +
ray_med_fn scratch allocations with a single reusable scratch buffer
(sized at max_grp_cnt) and an exported in-place quickselect helper
ray_median_dbl_inplace in agg.c.

Skips two ray-vector allocations per group; for q6's 10k-group case
the allocator savings dominate (median compute itself is O(n) and
unchanged).  Reverts to aggr_unary_per_group_buf for non-numeric
inputs (LIST/STR/etc).

OP_MEDIAN opcode + ray_median DAG-builder + prototype are added too,
but not yet wired into the planner — that's a follow-up if we want
median in the OP_GROUP fast path; for now `med` continues to land in
the eval-fallback streaming branch where the new fast path picks it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the bucket-scatter median pattern from query.c:3582 into the
second non-agg eval site at line 4028.  Modest improvement on q6
(9023→7253ms on 10m); the dominant cost is now per-group random
access into the 80MB v3 column (10000 groups × ~1000 cache-missing
reads each).  Closing the gap with DuckDB needs a real bucket-scatter
OP_MEDIAN that materialises group values into contiguous memory
before quickselect — a separate epic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sized [n_tasks * n_groups] hist/cursor matrices and the serial cumsum
that walks them scale with the dispatch grain, not the worker count.
With 10M rows × 100k groups (q8) the default 8K-morsel grain inflated
hist to ~1GB and the cumsum to ~120M cache-strided ops (~1.4s). Cap
n_tasks at total_workers via ray_pool_dispatch_n; q8 1540ms→162ms,
q6 241ms→121ms, both now faster than DuckDB.
ray_pool_dispatch_n silently clamps task count at MAX_RING_CAP=65536,
so per-group median/topk on >65k groups dropped the tail.  q8 at 10M
rows × 100k id6 groups returned 65536 cells instead of 100000.  Fall
back to elements-based ray_pool_dispatch above the cap (auto-grows
grain), keep dispatch_n below it (best parallelism for small per-group
work).
Pairwise concat loop was O(N²) — for 100k LIST<F64>[2] cells (q8
post-explode) it spent 2s allocating and copying cumulatively-sized
intermediates.  Pre-size one output vector and memcpy each item's
data when all inputs are same-typed fixed-width numerics with no
nulls; q8 explode 2200ms→52ms.
@ser-vasilich ser-vasilich force-pushed the feat/canonical-h2o-median branch from 4e6d5f5 to c0e2605 Compare May 16, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant