Use BooleanArray::has_true/has_false for short-circuit checks#21337
Use BooleanArray::has_true/has_false for short-circuit checks#21337Dandandan wants to merge 2 commits intoapache:mainfrom
Conversation
Replace `true_count() == 0`, `true_count() > 0`, `false_count() == 0`, and similar patterns with `has_true()`/`has_false()` where the actual count is not needed. These methods can short-circuit on the first matching value instead of counting all bits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In multi-branch CASE expressions, later branches commonly match no rows. Using has_true() avoids a full popcount in these cases by short-circuiting on the first true value found. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing has-true-replace (d53febf) to 1e93a67 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing has-true-replace (d53febf) to 1e93a67 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing has-true-replace (d53febf) to 1e93a67 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Which issue does this PR close?
N/A - minor optimization
Rationale for this change
BooleanArray::has_true()andhas_false()(added in arrow 58.1.0) can short-circuit on the first matching value instead of counting all set bits. This is more efficient when only checking for the presence/absence of true/false values.What changes are included in this PR?
Replace
true_count() == 0,true_count() > 0,false_count() == 0,true_count() == len, andfalse_count() == lenpatterns withhas_true()/has_false()in 7 files where the actual count value is not needed:nested_loop_join.rs- 2 sitesarray_has.rs- 1 sitemetadata.rs- 5 sitesreplace.rs- 1 sitesort_merge_join/filter.rs- 1 sitearray_contains.rs(spark) - 1 sitecase.rs- 2 sites: restructured CASE WHEN evaluation to usehas_true()as a fast guard beforetrue_count(). In multi-branch CASE expressions, later branches commonly match no rows, so short-circuiting avoids a full popcount in these cases.Cases where the count is used as a value (arithmetic, passed to functions, etc.) are left unchanged.
Are these changes tested?
Existing tests cover these code paths.
Are there any user-facing changes?
No.
🤖 Generated with Claude Code