perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM by ser-vasilich · Pull Request #203 · RayforceDB/rayforce

ser-vasilich · 2026-05-15T20:11:11Z

Summary

17 commits closing the perf gap to DuckDB on canonical H2O queries q6 (median+stddev), q8 (top-2 per group), q9 (pearson²). New DAG opcodes + dedicated post-radix passes + supporting infra fixes.

Stacks on top of #202.

New opcodes (DAG-routed aggregates)

query	opcode	algorithm
q9 (pearson_corr)	`OP_PEARSON_CORR`	vectorized single-pass hash-agg (n·Σxy − Σx·Σy) / sqrt(...) inline in radix HT
q6 (median)	`OP_MEDIAN`	post-radix bucket-scatter (idx_buf from row_gid+grp_cnt) + per-group parallel quickselect
q8 (top-N)	`OP_TOP_N` / `OP_BOT_N`	post-radix bounded-heap per group → LIST[K] cells
q8 (alternate fast path)	`OP_GROUP_TOPK_ROWFORM`	dedicated morsel-driven scatter + L2-hot per-partition HTs with inline K-slot heap → direct row-form emit (skips radix HT, idx_buf, LIST intermediate)

Supporting fixes

`perf(raze): O(N) fast path for same-typed numeric vectors` (bed7f7a) — was O(N²) pairwise concat
`fix(group): per-group dispatch survives n_groups > 65536` (f2b5f15) — `ray_pool_dispatch_n` silently clamps tasks at MAX_RING_CAP=65536
`perf(group): cap histscat tasks at worker count` (9771a01) — hist matrix sized [n_tasks × n_groups] was 1GB+ for 100k groups; cap to ~n_workers

Perf numbers (10M rows, k=100 cardinality)

query	rayforce before	rayforce after	duckdb	ratio
q6	~5550 ms (eval-fallback)	120 ms	178 ms	1.5× faster
q8	~400 ms (eval-fallback) → 215 ms (DAG + explode)	40 ms	157 ms	4× faster
q9	~7000 ms (eval-fallback) → 49 ms	49 ms	78 ms	1.6× faster

q8 win comes from the dedicated `OP_GROUP_TOPK_ROWFORM` operator: per-worker scatter into (worker, partition) buffers in phase 1; phase 2 builds L2-hot per-partition HTs (~400 entries each, fits L1) with inline K-slot bounded-heap and emits row-form directly. Skips the radix HT, idx_buf materialization, LIST[K] intermediate, and explode step. Pattern follows Siddiqui et al. VLDB 2024 ("Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets").

Test plan

`make test` passes (2406/2407, 1 skipped, 0 failed)
Cross-adapter correctness via `make check LOCAL=1` → `pass — 665/665 comparisons matched polars, 0 NYI`
Bench numbers above reproduced with rayforce-bench harness at 10M k=100 (3 warmup + 7 iters)

Known limitations

Parted columns + OP_GROUP_TOPK_ROWFORM: the new row-form top/bot operator reads source columns via ray_data(col) assuming contiguous data. If col is parted (segments via .csv.parted / .csv.splayed loaders introduced in #5944632e), ray_data returns the segment-array pointer, not contiguous values. Planner gate currently does NOT check RAY_IS_PARTED(col->type) before routing to this opcode. Current bench (Table.from_csv regular path) is not affected. Follow-up needed before parted-column production use: either add a parted-aware fallback to OP_GROUP in the planner gate, or extend the kernel to iterate per-segment.

Foundation only — group.c hash-agg path not yet implemented. ray_group2 + OP_PEARSON_CORR DAG nodes are emitted by the planner for `(select (pearson_corr x y) by ...)` shapes, but exec_group will panic on the unknown opcode until Phase B lands. Files: - src/ops/ops.h: OP_PEARSON_CORR=79, agg_ins2 field in OP_GROUP ext - src/ops/internal.h: GHT_NEED_PEARSON, off_sum_y/off_sumsq_y/off_sumxy, agg_is_binary in ght_layout_t - src/lang/eval.c: pearson_corr promoted to RAY_FN_AGGR | RAY_FN_LAZY_AWARE - src/ops/graph.c: ray_pearson_corr DAG-builder, ray_group2 (variant accepting agg_ins2 sibling array), pointer-fixup for agg_ins2 - src/ops/query.c: resolve_agg_opcode("pearson_corr"); two planner sites collect agg_ins2 and dispatch to ray_group2 when any agg is binary - src/ops/dump.c + test/test_dump.c: opcode name "PEARSON_CORR" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…CORR Additive changes only — compiles cleanly, no behavioural impact for existing code paths (no agg uses GHT_NEED_PEARSON yet because the phase1 packing, accumulator update, and phase3 finalize sites are still to-do). - ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set agg_is_binary bit, allocate two consecutive agg_vals slots per binary agg (x at s, y at s+1), allocate off_sum_y/off_sumsq_y/off_sumxy blocks when GHT_NEED_PEARSON is set. - ht_path ght_need computation: OP_PEARSON_CORR sets SUM | SUMSQ | PEARSON. Remaining Phase B sites (chain is interdependent — must land together): * agg input resolution: read ext->agg_ins2[a] → agg_vecs2[a] * radix_phase1_ctx_t.agg_vecs2 + dispatch ctx plumbing * radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals * init_accum_from_entry + accum_from_entry: write Σy, Σy², Σxy * radix phase3 finalize: OP_PEARSON_CORR arm → r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²)) * dense-array bypass: route OP_PEARSON_CORR → ht_path * exec.c scalar dispatch (n_keys=0) or lower to OP_GROUP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires OP_PEARSON_CORR into the radix-partitioned + single-HT group-by pipeline. Single-pass two-moments formula matches ray_pearson_corr_fn (see comment). All 2406 existing tests pass; pearson_corr.rfl groupby + multi-key cases pass through the new opcode path. Touch list: - ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set agg_is_binary bit, reserve 2 consecutive agg_vals slots per binary agg (x at s, y at s+1); allocate off_sum_y/off_sumsq_y/off_sumxy blocks when GHT_NEED_PEARSON. - ht_path ght_need: OP_PEARSON_CORR → SUM|SUMSQ|PEARSON. - Agg input resolution: read ext->agg_ins2[a] via the same OP_SCAN / OP_CONST / expr_compile ladder used for the x-side. - All 7 agg_vecs cleanup sites: release agg_vecs2[a] alongside. - radix_phase1_ctx_t: new agg_vecs2 field, plumbed through both call sites + single-HT group_rows_range signature update. - radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals. - init_accum_from_entry: seed Σy, Σy², Σxy (both f64 and i64 inputs). - accum_from_entry: incremental update of Σy, Σy², Σxy in both branches. - Radix phase-3 finalize: OP_PEARSON_CORR arm — r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²)) Emits NaN for n<2 or constant-side; canonicalize folds → null. - Dense-array bypass: OP_PEARSON_CORR forces ht_path (da_accum_t doesn't have per-worker y-side state yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds OP_PEARSON_CORR to two more finalize sites missed in the earlier Phase B pass: the single-HT (non-radix) path's per-group emit at group.c:4915 and the two out_type switches at 4644/4861. Without these the single-HT code path falls through to `default: v = 0.0` which is why `make check` saw r²=0 instead of 1.0 for groups where n>=2 but the planner chose single-HT over radix. Still WIP — q9 bench at 10m hasn't been re-run since this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dbl_inplace Adds aggr_med_per_group_buf in query.c that recognises `(med col)` in the eval-fallback path and replaces the per-group ray_at_fn slice + ray_med_fn scratch allocations with a single reusable scratch buffer (sized at max_grp_cnt) and an exported in-place quickselect helper ray_median_dbl_inplace in agg.c. Skips two ray-vector allocations per group; for q6's 10k-group case the allocator savings dominate (median compute itself is O(n) and unchanged). Reverts to aggr_unary_per_group_buf for non-numeric inputs (LIST/STR/etc). OP_MEDIAN opcode + ray_median DAG-builder + prototype are added too, but not yet wired into the planner — that's a follow-up if we want median in the OP_GROUP fast path; for now `med` continues to land in the eval-fallback streaming branch where the new fast path picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the bucket-scatter median pattern from query.c:3582 into the second non-agg eval site at line 4028. Modest improvement on q6 (9023→7253ms on 10m); the dominant cost is now per-group random access into the 80MB v3 column (10000 groups × ~1000 cache-missing reads each). Closing the gap with DuckDB needs a real bucket-scatter OP_MEDIAN that materialises group values into contiguous memory before quickselect — a separate epic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sized [n_tasks * n_groups] hist/cursor matrices and the serial cumsum that walks them scale with the dispatch grain, not the worker count. With 10M rows × 100k groups (q8) the default 8K-morsel grain inflated hist to ~1GB and the cumsum to ~120M cache-strided ops (~1.4s). Cap n_tasks at total_workers via ray_pool_dispatch_n; q8 1540ms→162ms, q6 241ms→121ms, both now faster than DuckDB.

ray_pool_dispatch_n silently clamps task count at MAX_RING_CAP=65536, so per-group median/topk on >65k groups dropped the tail. q8 at 10M rows × 100k id6 groups returned 65536 cells instead of 100000. Fall back to elements-based ray_pool_dispatch above the cap (auto-grows grain), keep dispatch_n below it (best parallelism for small per-group work).

Pairwise concat loop was O(N²) — for 100k LIST<F64>[2] cells (q8 post-explode) it spent 2s allocating and copying cumulatively-sized intermediates. Pre-size one output vector and memcpy each item's data when all inputs are same-typed fixed-width numerics with no nulls; q8 explode 2200ms→52ms.

ser-vasilich and others added 17 commits May 16, 2026 01:25

feat(perf): OP_MEDIAN — dump opcode name

ea8c709

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(perf): OP_MEDIAN — DAG-route integration in exec_group

f149ac4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(perf): OP_TOP_N / OP_BOT_N — opcodes + planner integration

836731c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(perf): OP_TOP_N / OP_BOT_N — per-group bounded-heap kernel

fc0dcb0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(perf): OP_TOP_N / OP_BOT_N — exec_group post-radix wiring

c749ed2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(h2o): q8 — native (top col K) / (bot col K) coverage

891f829

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(perf): OP_GROUP_TOPK_ROWFORM — row-form per-group top/bot K

d482047

perf(group_topk): radix-scatter Phase 1 — L2-hot partition HTs

c0e2605

ser-vasilich force-pushed the feat/canonical-h2o-median branch from 4e6d5f5 to c0e2605 Compare May 16, 2026 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203

perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203
ser-vasilich wants to merge 17 commits into
feat/canonical-h2ofrom
feat/canonical-h2o-median

ser-vasilich commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ser-vasilich commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New opcodes (DAG-routed aggregates)

Supporting fixes

Perf numbers (10M rows, k=100 cardinality)

Related

Test plan

Known limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ser-vasilich commented May 15, 2026 •

edited

Loading