Optimize GroupValuesPrimitive::intern with vectorized hashing and inline values#21344
Optimize GroupValuesPrimitive::intern with vectorized hashing and inline values#21344Dandandan wants to merge 1 commit intoapache:mainfrom
Conversation
…ine values Two optimizations for the single-column primitive GROUP BY hot path: 1. Vectorized hashing: Split intern into two phases — batch hash computation via with_hashes (tight loop, better pipelining) followed by hash table probing with pre-computed hashes. 2. Inline values in hash table: Store the actual value in each hash table entry (usize, T::Native) instead of (usize, u64) with an indirect lookup into a separate values vec. This eliminates one cache miss per probe and removes the need to store the hash — the value can be rehashed from the inline copy when needed (rare, only during table growth). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_extended clickbench |
|
run benchmark tpch_mem |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff using: clickbench_extended File an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-primitive-group-values-intern (05b874d) to 1e93a67 (merge-base) diff using: tpch_mem File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch_mem — base (merge-base)
tpch_mem — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_extended — base (merge-base)
clickbench_extended — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Related to #15961
Rationale for this change
Profiling
SELECT COUNT(DISTINCT "UserID") FROM hits(ClickBench) showedGroupValuesPrimitive::internas a hot spot, withhashbrown::raw::RawTable::reserve_rehashandGroupValuesPrimitive::interndominating the flamegraph.What changes are included in this PR?
Two optimizations for the single-column primitive GROUP BY hot path:
Vectorized hashing: Split
interninto two phases — batch hash computation viawith_hashes(tight loop, better CPU pipelining) followed by hash table probing with pre-computed hashes. The original code interleaved hash computation with hash table probing on every row, preventing the CPU from pipelining the hash computation.Inline values in hash table: Store the actual value in each hash table entry
(usize, T::Native)instead of(usize, u64)with an indirect lookup into a separatevaluesvec. This eliminates one cache miss per probe (no pointer chase from hash table entry → values array) and removes the need to store the hash — the value can be rehashed from the inline copy when needed (rare, only during table growth).Are these changes tested?
Existing tests cover this code path.
Are there any user-facing changes?
No, this is a performance optimization only.
🤖 Generated with Claude Code