perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203
Open
ser-vasilich wants to merge 17 commits into
Open
perf: canonical H2O q6/q8/q9 — OP_MEDIAN, OP_PEARSON_CORR, OP_TOP_N, OP_GROUP_TOPK_ROWFORM#203ser-vasilich wants to merge 17 commits into
ser-vasilich wants to merge 17 commits into
Conversation
This was referenced May 15, 2026
Foundation only — group.c hash-agg path not yet implemented. ray_group2
+ OP_PEARSON_CORR DAG nodes are emitted by the planner for `(select
(pearson_corr x y) by ...)` shapes, but exec_group will panic on the
unknown opcode until Phase B lands.
Files:
- src/ops/ops.h: OP_PEARSON_CORR=79, agg_ins2 field in OP_GROUP ext
- src/ops/internal.h: GHT_NEED_PEARSON, off_sum_y/off_sumsq_y/off_sumxy,
agg_is_binary in ght_layout_t
- src/lang/eval.c: pearson_corr promoted to RAY_FN_AGGR | RAY_FN_LAZY_AWARE
- src/ops/graph.c: ray_pearson_corr DAG-builder, ray_group2 (variant
accepting agg_ins2 sibling array), pointer-fixup for agg_ins2
- src/ops/query.c: resolve_agg_opcode("pearson_corr"); two planner sites
collect agg_ins2 and dispatch to ray_group2 when any agg is binary
- src/ops/dump.c + test/test_dump.c: opcode name "PEARSON_CORR"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CORR
Additive changes only — compiles cleanly, no behavioural impact for
existing code paths (no agg uses GHT_NEED_PEARSON yet because the
phase1 packing, accumulator update, and phase3 finalize sites are
still to-do).
- ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set
agg_is_binary bit, allocate two consecutive agg_vals slots per binary
agg (x at s, y at s+1), allocate off_sum_y/off_sumsq_y/off_sumxy
blocks when GHT_NEED_PEARSON is set.
- ht_path ght_need computation: OP_PEARSON_CORR sets SUM | SUMSQ |
PEARSON.
Remaining Phase B sites (chain is interdependent — must land together):
* agg input resolution: read ext->agg_ins2[a] → agg_vecs2[a]
* radix_phase1_ctx_t.agg_vecs2 + dispatch ctx plumbing
* radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals
* init_accum_from_entry + accum_from_entry: write Σy, Σy², Σxy
* radix phase3 finalize: OP_PEARSON_CORR arm → r = (n·Σxy − Σx·Σy) /
sqrt((n·Σx² − Σx²)(n·Σy² − Σy²))
* dense-array bypass: route OP_PEARSON_CORR → ht_path
* exec.c scalar dispatch (n_keys=0) or lower to OP_GROUP
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires OP_PEARSON_CORR into the radix-partitioned + single-HT group-by
pipeline. Single-pass two-moments formula matches ray_pearson_corr_fn
(see comment). All 2406 existing tests pass; pearson_corr.rfl groupby
+ multi-key cases pass through the new opcode path.
Touch list:
- ght_compute_layout: detect OP_PEARSON_CORR via agg_ops, set
agg_is_binary bit, reserve 2 consecutive agg_vals slots per binary
agg (x at s, y at s+1); allocate off_sum_y/off_sumsq_y/off_sumxy
blocks when GHT_NEED_PEARSON.
- ht_path ght_need: OP_PEARSON_CORR → SUM|SUMSQ|PEARSON.
- Agg input resolution: read ext->agg_ins2[a] via the same OP_SCAN /
OP_CONST / expr_compile ladder used for the x-side.
- All 7 agg_vecs cleanup sites: release agg_vecs2[a] alongside.
- radix_phase1_ctx_t: new agg_vecs2 field, plumbed through both
call sites + single-HT group_rows_range signature update.
- radix_phase1_fn + group_rows_range: pack y after x in entry agg_vals.
- init_accum_from_entry: seed Σy, Σy², Σxy (both f64 and i64 inputs).
- accum_from_entry: incremental update of Σy, Σy², Σxy in both branches.
- Radix phase-3 finalize: OP_PEARSON_CORR arm —
r = (n·Σxy − Σx·Σy) / sqrt((n·Σx² − Σx²)(n·Σy² − Σy²))
Emits NaN for n<2 or constant-side; canonicalize folds → null.
- Dense-array bypass: OP_PEARSON_CORR forces ht_path (da_accum_t
doesn't have per-worker y-side state yet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OP_PEARSON_CORR to two more finalize sites missed in the earlier Phase B pass: the single-HT (non-radix) path's per-group emit at group.c:4915 and the two out_type switches at 4644/4861. Without these the single-HT code path falls through to `default: v = 0.0` which is why `make check` saw r²=0 instead of 1.0 for groups where n>=2 but the planner chose single-HT over radix. Still WIP — q9 bench at 10m hasn't been re-run since this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dbl_inplace Adds aggr_med_per_group_buf in query.c that recognises `(med col)` in the eval-fallback path and replaces the per-group ray_at_fn slice + ray_med_fn scratch allocations with a single reusable scratch buffer (sized at max_grp_cnt) and an exported in-place quickselect helper ray_median_dbl_inplace in agg.c. Skips two ray-vector allocations per group; for q6's 10k-group case the allocator savings dominate (median compute itself is O(n) and unchanged). Reverts to aggr_unary_per_group_buf for non-numeric inputs (LIST/STR/etc). OP_MEDIAN opcode + ray_median DAG-builder + prototype are added too, but not yet wired into the planner — that's a follow-up if we want median in the OP_GROUP fast path; for now `med` continues to land in the eval-fallback streaming branch where the new fast path picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the bucket-scatter median pattern from query.c:3582 into the second non-agg eval site at line 4028. Modest improvement on q6 (9023→7253ms on 10m); the dominant cost is now per-group random access into the 80MB v3 column (10000 groups × ~1000 cache-missing reads each). Closing the gap with DuckDB needs a real bucket-scatter OP_MEDIAN that materialises group values into contiguous memory before quickselect — a separate epic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sized [n_tasks * n_groups] hist/cursor matrices and the serial cumsum that walks them scale with the dispatch grain, not the worker count. With 10M rows × 100k groups (q8) the default 8K-morsel grain inflated hist to ~1GB and the cumsum to ~120M cache-strided ops (~1.4s). Cap n_tasks at total_workers via ray_pool_dispatch_n; q8 1540ms→162ms, q6 241ms→121ms, both now faster than DuckDB.
ray_pool_dispatch_n silently clamps task count at MAX_RING_CAP=65536, so per-group median/topk on >65k groups dropped the tail. q8 at 10M rows × 100k id6 groups returned 65536 cells instead of 100000. Fall back to elements-based ray_pool_dispatch above the cap (auto-grows grain), keep dispatch_n below it (best parallelism for small per-group work).
Pairwise concat loop was O(N²) — for 100k LIST<F64>[2] cells (q8 post-explode) it spent 2s allocating and copying cumulatively-sized intermediates. Pre-size one output vector and memcpy each item's data when all inputs are same-typed fixed-width numerics with no nulls; q8 explode 2200ms→52ms.
4e6d5f5 to
c0e2605
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
17 commits closing the perf gap to DuckDB on canonical H2O queries q6 (median+stddev), q8 (top-2 per group), q9 (pearson²). New DAG opcodes + dedicated post-radix passes + supporting infra fixes.
Stacks on top of #202.
New opcodes (DAG-routed aggregates)
Supporting fixes
Perf numbers (10M rows, k=100 cardinality)
q8 win comes from the dedicated `OP_GROUP_TOPK_ROWFORM` operator: per-worker scatter into (worker, partition) buffers in phase 1; phase 2 builds L2-hot per-partition HTs (~400 entries each, fits L1) with inline K-slot bounded-heap and emits row-form directly. Skips the radix HT, idx_buf materialization, LIST[K] intermediate, and explode step. Pattern follows Siddiqui et al. VLDB 2024 ("Cache-Efficient Top-k Aggregation over High Cardinality Large Datasets").
Related
Test plan
Known limitations
OP_GROUP_TOPK_ROWFORM: the new row-form top/bot operator reads source columns viaray_data(col)assuming contiguous data. Ifcolis parted (segments via.csv.parted/.csv.splayedloaders introduced in #5944632e),ray_datareturns the segment-array pointer, not contiguous values. Planner gate currently does NOT checkRAY_IS_PARTED(col->type)before routing to this opcode. Current bench (Table.from_csvregular path) is not affected. Follow-up needed before parted-column production use: either add a parted-aware fallback toOP_GROUPin the planner gate, or extend the kernel to iterate per-segment.