perf: sort-merge join (SMJ) batch deferred filtering and move mark joins to bitwise stream. Near-unique LEFT and FULL SMJ 20-50x faster by mbutrovich · Pull Request #21184 · apache/datafusion

mbutrovich · 2026-03-26T18:00:18Z

Which issue does this PR close?

Partially addresses #20910. Fixes #21197.

Rationale for this change

Sort-merge join with a filter on outer joins (LEFT/RIGHT/FULL) runs process_filtered_batches() on every key transition in the Init state. With near-unique keys (1:1 cardinality), this means running the full deferred filtering pipeline (concat + get_corrected_filter_mask + filter_record_batch_by_join_type) once per row — making filtered LEFT/RIGHT/FULL 55x slower than INNER for 10M unique keys.

Additionally, mark join logic in MaterializingSortMergeJoinStream materializes full (streamed, buffered) pairs only to discard most of them via get_corrected_filter_mask(). Mark joins are structurally identical to semi joins (one output row per outer row with a boolean result) and belong in BitwiseSortMergeJoinStream, which avoids pair materialization entirely using a per-outer-batch bitset.

What changes are included in this PR?

Three areas of improvement, building on the specialized semi/anti stream from #20806:

1. Move mark joins to BitwiseSortMergeJoinStream

Match on join type; emit_outer_batch() emits all rows with the match bitset as a boolean column (vs semi's filter / anti's invert-and-filter)
Route LeftMark/RightMark from SortMergeJoinExec::execute() to the bitwise stream
Remove all mark-specific logic from MaterializingSortMergeJoinStream (mark_row_as_match, is_not_null column generation, mark arms in filter correction)

2. Batch filter evaluation in freeze_streamed()

Split freeze_streamed() into null-joined classification + freeze_streamed_matched() for batched materialization
Collect indices across chunks, materialize left/right columns once using tiered Arrow kernels (slice → take → interleave)
Single RecordBatch construction and single expression.evaluate() per freeze instead of per chunk
Vectorize append_filter_metadata() using builder extend() instead of per-element loop

3. Batch deferred filtering in Init state (this is the big win for Q22 and Q23)

Gate process_filtered_batches() on accumulated rows >= batch_size instead of running on every Init entry
Accumulated data bounded to ~2×batch_size (one from freeze_dequeuing_buffered, one accumulating toward next freeze) — does not reintroduce unbounded buffering fixed by PR fix: SortMergeJoin don't wait for all input before emitting #20482
Exhausted state flushes any remainder

Cleanup:

Rename SortMergeJoinStream → MaterializingSortMergeJoinStream (materializes explicit row pairs for join output) and SemiAntiMarkSortMergeJoinStream → BitwiseSortMergeJoinStream (tracks matches via boolean bitset)
Consolidate semi_anti_mark_sort_merge_join/ into sort_merge_join/ as bitwise_stream.rs / bitwise_tests.rs; rename stream.rs → materializing_stream.rs and tests.rs → materializing_tests.rs
Consolidate SpillManager construction into SortMergeJoinExec::execute() (shared across both streams); move peak_mem_used gauge into BitwiseSortMergeJoinStream::try_new
MaterializingSortMergeJoinStream now handles only Inner/Left/Right/Full — all semi/anti/mark branching removed
get_corrected_filter_mask(): merge identical Left/Right/Full branches; add null-metadata passthrough for already-null-joined rows
filter_record_batch_by_join_type(): rewrite from filter(true) + filter(false) + concat to zip() for in-place null-joining — preserves row ordering and removes create_null_joined_batch() entirely; add early return for empty batches
filter_record_batch_by_join_type(): use compute::filter() directly on BooleanArray instead of wrapping in temporary RecordBatch

Benchmarks

cargo run --release --bin dfbench -- smj

Query	Join Type	Rows	Keys	Filter	Main (ms)	PR (ms)	Speedup
Q1	INNER	1M×1M	1:1	—	16.3	14.4	1.1x
Q2	INNER	1M×10M	1:10	—	117.4	120.1	1.0x
Q3	INNER	1M×1M	1:100	—	74.2	66.6	1.1x
Q4	INNER	1M×10M	1:10	1%	17.1	15.1	1.1x
Q5	INNER	1M×1M	1:100	10%	18.4	14.4	1.3x
Q6	LEFT	1M×10M	1:10	—	129.3	122.7	1.1x
Q7	LEFT	1M×10M	1:10	50%	150.2	142.2	1.1x
Q8	FULL	1M×1M	1:10	—	16.6	16.7	1.0x
Q9	FULL	1M×10M	1:10	10%	153.5	136.2	1.1x
Q10	LEFT SEMI	1M×10M	1:10	—	53.1	53.1	1.0x
Q11	LEFT SEMI	1M×10M	1:10	1%	15.5	14.7	1.1x
Q12	LEFT SEMI	1M×10M	1:10	50%	65.0	67.3	1.0x
Q13	LEFT SEMI	1M×10M	1:10	90%	105.7	109.8	1.0x
Q14	LEFT ANTI	1M×10M	1:10	—	54.3	53.9	1.0x
Q15	LEFT ANTI	1M×10M	1:10	partial	51.5	50.5	1.0x
Q16	LEFT ANTI	1M×1M	1:1	—	10.3	11.3	0.9x
Q17	INNER	1M×50M	1:50	5%	75.9	79.0	1.0x
Q18	LEFT SEMI	1M×50M	1:50	2%	50.2	49.0	1.0x
Q19	LEFT ANTI	1M×50M	1:50	partial	336.4	344.2	1.0x
Q20	INNER	1M×10M	1:100	GROUP BY	763.7	803.9	1.0x
Q21	INNER	10M×10M	1:1	50%	186.1	187.8	1.0x
Q22	LEFT	10M×10M	1:1	50%	10,193.8	185.8	54.9x
Q23	FULL	10M×10M	1:1	50%	10,194.7	233.6	43.6x
Q24	LEFT MARK	1M×10M	1:10	1%	FAILS	15.1	—
Q25	LEFT MARK	1M×10M	1:10	50%	FAILS	67.3	—
Q26	LEFT MARK	1M×10M	1:10	90%	FAILS	110.0	—

General workload (Q1-Q20, various join types/cardinalities/selectivities): no regressions.

Are these changes tested?

In addition to existing unit and sqllogictests:

I ran 50 iterations of the fuzz tests (modified to only test against hash join as the baseline because nested loop join takes too long) cargo test -p datafusion --features extended_tests --test fuzz -- join_fuzz
One new sqllogictest for bug: sort-merge join (SMJ) LeftMark join with join filter crashes on non-nullable columns #21197 that fails on main
Four new unit tests: three for full join with filter that spills
One new fuzz test to exercise full join with filter that spills
New benchmark queries Q21-Q23: 10M×10M unique keys with 50% join filter for INNER/LEFT/FULL — exercises the degenerate case this PR fixes
New benchmark queries Q24-Q26 duplicated Q11-Q13 but for Mark joins, showing that they have the same performance as other joins (LeftSemi) that use this stream

Are there any user-facing changes?

No.

mbutrovich · 2026-03-26T18:26:47Z

Tagging folks who had feedback on recent SMJ changes @comphead @rluvaton @stuhood. Thank you!

rluvaton · 2026-03-26T18:27:26Z

run benchmarks sort_merge_join

mbutrovich · 2026-03-26T18:29:37Z

run benchmarks sort_merge_join

Note that the 2 queries I expect a speedup on in the smj suite are new in this PR, so I don't think we'll see their performance against main. I had to hoist the benchmark to main and run it locally for the comparison in the PR description.

adriangbot · 2026-03-26T18:31:10Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4137272954-565-stzm5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing simplify_smj_full_opt (1c1bec5) to ba399a8 (merge-base) diff
BENCH_NAME=sort_merge_join
BENCH_COMMAND=cargo bench --features=parquet --bench sort_merge_join
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-26T18:31:16Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

Cloning into '/workspace/datafusion-branch'...
simplify_smj_full_opt
From https://github.com/apache/datafusion
 * [new ref]         refs/pull/21184/head -> simplify_smj_full_opt
 * branch            main                 -> FETCH_HEAD
Switched to branch 'simplify_smj_full_opt'
ba399a80f9ffcb0563adf2b67add13d0476f6291
Cloning into '/workspace/datafusion-base'...
HEAD is now at ba399a8 docs: add KalamDB to known users (#21181)
rustc 1.94.0 (4a4ef493e 2026-03-02)
1c1bec5e7a217c366e704d1fd5bf8594a9e9540e
ba399a80f9ffcb0563adf2b67add13d0476f6291
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
error: target `sort_merge_join` in package `datafusion-physical-plan` requires the features: `test_utils`
Consider enabling them by passing, e.g., `--features="test_utils"`

File an issue against this benchmark runner

mbutrovich · 2026-03-26T18:35:19Z

Also I'm now confused where I should add benchmarks. #20464 added Criterion SMJ benchmarks for sort-merge join , but it's missing scenarios from dfbench's smj benchmarks, which I further extend here. Any help?

Dandandan · 2026-03-26T18:52:35Z

adriangb/datafusion-benchmarking#2

…emiAntiMark) into the same SMJ folder.

mbutrovich · 2026-03-27T19:16:51Z

    /// Calculated as sum of peak memory values across partitions
    peak_mem_used: Gauge,
-    /// Metrics related to spilling
-    spill_metrics: SpillMetrics,


Moved SpillMetrics construction from SortMergeJoinMetrics into exec.rs where the SpillManager is now built (shared across both streams). The metrics are still registered into the same ExecutionPlanMetricsSet and reported via metrics() — just constructed in a different place.

comphead

Thanks @mbutrovich numbers looks great, lets wait some time if anyone wants to take a look as well

Dandandan · 2026-03-28T04:52:46Z

run benchmark tpch tpcds

env:
   PREFER_HASH_JOIN: false

adriangbot · 2026-03-28T04:55:14Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4147028819-593-wfzrl 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing simplify_smj_full_opt (fa4e99d) to a910b03 (merge-base) diff using: tpcds
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-28T04:55:25Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4147028819-592-sw9rm 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing simplify_smj_full_opt (fa4e99d) to a910b03 (merge-base) diff using: tpch
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-28T05:07:41Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query     ┃                              HEAD ┃             simplify_smj_full_opt ┃    Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1  │    45.63 / 46.44 ±0.82 / 47.78 ms │    45.30 / 45.93 ±0.65 / 47.14 ms │ no change │
│ QQuery 2  │    62.48 / 63.64 ±1.31 / 66.20 ms │    62.30 / 62.97 ±0.51 / 63.62 ms │ no change │
│ QQuery 3  │    67.78 / 70.25 ±1.89 / 72.65 ms │    66.12 / 67.75 ±1.49 / 70.57 ms │ no change │
│ QQuery 4  │    50.13 / 51.48 ±1.17 / 53.07 ms │    49.43 / 50.95 ±1.39 / 52.82 ms │ no change │
│ QQuery 5  │ 105.13 / 109.49 ±2.52 / 112.29 ms │ 107.32 / 111.77 ±3.73 / 116.49 ms │ no change │
│ QQuery 6  │    17.70 / 18.06 ±0.49 / 19.00 ms │    17.39 / 17.67 ±0.29 / 18.19 ms │ no change │
│ QQuery 7  │ 132.32 / 134.39 ±2.11 / 137.66 ms │ 137.21 / 138.97 ±1.14 / 140.31 ms │ no change │
│ QQuery 8  │ 124.22 / 133.23 ±6.89 / 141.93 ms │ 125.87 / 128.78 ±1.97 / 130.82 ms │ no change │
│ QQuery 9  │ 174.33 / 180.92 ±4.40 / 185.62 ms │ 175.87 / 182.16 ±3.38 / 186.10 ms │ no change │
│ QQuery 10 │    87.20 / 89.97 ±2.19 / 92.93 ms │    86.94 / 88.81 ±1.17 / 90.39 ms │ no change │
│ QQuery 11 │    46.00 / 46.62 ±0.65 / 47.87 ms │    45.72 / 46.36 ±0.73 / 47.77 ms │ no change │
│ QQuery 12 │    38.28 / 39.08 ±0.55 / 39.87 ms │    37.02 / 38.54 ±0.80 / 39.34 ms │ no change │
│ QQuery 13 │    49.96 / 50.36 ±0.35 / 50.98 ms │    49.61 / 51.69 ±2.37 / 56.32 ms │ no change │
│ QQuery 14 │    29.51 / 30.09 ±0.55 / 30.85 ms │    29.36 / 29.66 ±0.20 / 30.00 ms │ no change │
│ QQuery 15 │    29.54 / 30.91 ±0.79 / 31.69 ms │    31.21 / 31.78 ±0.53 / 32.70 ms │ no change │
│ QQuery 16 │    21.68 / 22.33 ±0.46 / 22.79 ms │    21.79 / 22.33 ±0.47 / 22.94 ms │ no change │
│ QQuery 17 │ 119.16 / 120.99 ±0.96 / 121.93 ms │ 119.76 / 122.27 ±1.92 / 125.04 ms │ no change │
│ QQuery 18 │ 143.60 / 145.36 ±1.01 / 146.59 ms │ 145.78 / 147.90 ±1.33 / 149.03 ms │ no change │
│ QQuery 19 │    39.58 / 39.98 ±0.36 / 40.43 ms │    39.19 / 39.91 ±0.58 / 40.69 ms │ no change │
│ QQuery 20 │    62.05 / 62.76 ±0.74 / 64.16 ms │    58.76 / 61.35 ±1.96 / 63.94 ms │ no change │
│ QQuery 21 │ 268.13 / 271.90 ±3.45 / 276.05 ms │ 266.61 / 272.13 ±3.70 / 277.01 ms │ no change │
│ QQuery 22 │    37.68 / 39.33 ±1.12 / 40.85 ms │    37.40 / 39.42 ±1.62 / 41.55 ms │ no change │
└───────────┴───────────────────────────────────┴───────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1797.58ms │
│ Total Time (simplify_smj_full_opt)   │ 1799.10ms │
│ Average Time (HEAD)                  │   81.71ms │
│ Average Time (simplify_smj_full_opt) │   81.78ms │
│ Queries Faster                       │         0 │
│ Queries Slower                       │         0 │
│ Queries with No Change               │        22 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

Resource Usage

tpch — base (merge-base)

Metric	Value
Wall time	9.3s
Peak memory	5.0 GiB
Avg memory	4.0 GiB
CPU user	74.2s
CPU sys	4.5s
Disk read	0 B
Disk write	136.0 KiB

tpch — branch

Metric	Value
Wall time	9.2s
Peak memory	4.9 GiB
Avg memory	4.0 GiB
CPU user	74.6s
CPU sys	4.3s
Disk read	0 B
Disk write	60.0 KiB

File an issue against this benchmark runner

adriangbot · 2026-03-28T05:14:32Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                  simplify_smj_full_opt ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │         51.09 / 52.02 ±1.02 / 54.00 ms │         50.68 / 51.50 ±0.97 / 53.25 ms │     no change │
│ QQuery 2  │      146.08 / 147.34 ±0.73 / 148.13 ms │      145.33 / 146.63 ±0.93 / 147.70 ms │     no change │
│ QQuery 3  │      168.50 / 170.52 ±1.17 / 171.81 ms │      166.08 / 167.54 ±1.04 / 168.82 ms │     no change │
│ QQuery 4  │  1678.84 / 1733.94 ±28.18 / 1756.36 ms │  1675.86 / 1727.75 ±34.00 / 1780.90 ms │     no change │
│ QQuery 5  │      277.11 / 284.62 ±4.23 / 289.19 ms │      281.69 / 286.76 ±3.96 / 293.55 ms │     no change │
│ QQuery 6  │      235.56 / 240.12 ±3.54 / 244.19 ms │      237.09 / 242.92 ±4.56 / 249.06 ms │     no change │
│ QQuery 7  │      387.71 / 390.52 ±2.19 / 393.37 ms │      384.31 / 385.45 ±0.73 / 386.51 ms │     no change │
│ QQuery 8  │      186.30 / 188.29 ±1.52 / 190.86 ms │      184.83 / 189.43 ±3.90 / 194.36 ms │     no change │
│ QQuery 9  │      103.13 / 109.17 ±6.07 / 120.77 ms │      100.61 / 105.25 ±2.51 / 107.59 ms │     no change │
│ QQuery 10 │      253.26 / 258.16 ±4.20 / 264.32 ms │      257.72 / 259.50 ±1.36 / 260.88 ms │     no change │
│ QQuery 11 │      833.80 / 848.81 ±8.47 / 858.80 ms │     847.20 / 861.23 ±10.41 / 877.97 ms │     no change │
│ QQuery 12 │         52.83 / 56.20 ±3.90 / 63.20 ms │         52.71 / 54.52 ±2.27 / 58.88 ms │     no change │
│ QQuery 13 │      401.34 / 405.13 ±2.96 / 409.40 ms │      402.53 / 404.12 ±2.46 / 408.96 ms │     no change │
│ QQuery 14 │  2068.66 / 2107.96 ±23.38 / 2131.81 ms │   2103.86 / 2111.03 ±7.52 / 2121.23 ms │     no change │
│ QQuery 15 │         91.61 / 92.99 ±1.79 / 96.52 ms │         91.36 / 92.42 ±1.08 / 94.46 ms │     no change │
│ QQuery 16 │      158.84 / 160.22 ±1.47 / 162.87 ms │      154.89 / 159.38 ±2.35 / 161.57 ms │     no change │
│ QQuery 17 │      383.52 / 385.64 ±2.25 / 388.88 ms │      379.81 / 383.80 ±3.29 / 387.18 ms │     no change │
│ QQuery 18 │      290.97 / 295.59 ±3.63 / 300.59 ms │      285.76 / 291.02 ±3.19 / 295.72 ms │     no change │
│ QQuery 19 │      238.58 / 242.31 ±6.29 / 254.86 ms │      239.04 / 241.21 ±2.51 / 245.99 ms │     no change │
│ QQuery 20 │         90.42 / 91.95 ±1.09 / 93.23 ms │         89.59 / 92.42 ±1.65 / 94.06 ms │     no change │
│ QQuery 21 │      832.94 / 847.21 ±8.99 / 859.10 ms │     818.46 / 839.22 ±13.64 / 856.22 ms │     no change │
│ QQuery 22 │      394.04 / 399.91 ±3.33 / 404.25 ms │     398.70 / 415.59 ±12.12 / 431.87 ms │     no change │
│ QQuery 23 │  1132.98 / 1154.83 ±13.13 / 1173.82 ms │  1140.54 / 1155.67 ±13.73 / 1178.18 ms │     no change │
│ QQuery 24 │      905.33 / 913.43 ±8.64 / 929.16 ms │      900.79 / 908.16 ±3.81 / 911.13 ms │     no change │
│ QQuery 25 │      432.61 / 439.01 ±5.32 / 445.52 ms │      437.99 / 444.54 ±4.26 / 450.16 ms │     no change │
│ QQuery 26 │      201.25 / 207.81 ±4.17 / 214.25 ms │      204.37 / 207.38 ±2.39 / 210.20 ms │     no change │
│ QQuery 27 │      379.90 / 385.50 ±3.71 / 391.07 ms │      379.63 / 381.98 ±2.47 / 386.34 ms │     no change │
│ QQuery 28 │      151.83 / 154.23 ±1.98 / 157.24 ms │      150.19 / 154.14 ±2.07 / 155.91 ms │     no change │
│ QQuery 29 │      382.02 / 385.53 ±3.80 / 392.70 ms │      378.04 / 380.53 ±2.03 / 383.25 ms │     no change │
│ QQuery 30 │         54.54 / 56.29 ±1.72 / 59.39 ms │         55.22 / 57.61 ±1.80 / 60.23 ms │     no change │
│ QQuery 31 │      641.56 / 648.88 ±4.18 / 654.53 ms │      645.50 / 647.48 ±1.57 / 649.87 ms │     no change │
│ QQuery 32 │      146.88 / 148.23 ±0.69 / 148.74 ms │      147.31 / 149.40 ±1.96 / 152.25 ms │     no change │
│ QQuery 33 │      218.15 / 222.32 ±2.22 / 224.15 ms │      217.01 / 220.73 ±2.41 / 223.97 ms │     no change │
│ QQuery 34 │      187.32 / 190.65 ±2.55 / 195.20 ms │      187.58 / 191.30 ±2.73 / 195.80 ms │     no change │
│ QQuery 35 │      243.92 / 245.95 ±1.75 / 248.69 ms │      244.46 / 247.95 ±2.58 / 251.99 ms │     no change │
│ QQuery 36 │      280.48 / 285.19 ±3.53 / 289.81 ms │      276.39 / 281.67 ±3.99 / 287.02 ms │     no change │
│ QQuery 37 │      287.66 / 292.57 ±3.29 / 298.02 ms │      281.98 / 286.52 ±3.50 / 292.43 ms │     no change │
│ QQuery 38 │      198.56 / 200.58 ±2.19 / 204.77 ms │      198.91 / 200.68 ±1.05 / 201.91 ms │     no change │
│ QQuery 39 │  3910.56 / 3954.99 ±24.64 / 3983.12 ms │  3877.10 / 3919.83 ±28.00 / 3958.97 ms │     no change │
│ QQuery 40 │      170.76 / 174.47 ±2.90 / 179.19 ms │      173.40 / 176.33 ±2.25 / 180.02 ms │     no change │
│ QQuery 41 │         14.34 / 15.55 ±1.31 / 18.05 ms │         14.42 / 16.65 ±2.31 / 19.61 ms │  1.07x slower │
│ QQuery 42 │      171.23 / 173.47 ±1.53 / 176.04 ms │      168.91 / 170.75 ±1.41 / 172.50 ms │     no change │
│ QQuery 43 │      150.30 / 153.90 ±2.29 / 156.73 ms │      149.10 / 150.74 ±1.53 / 152.78 ms │     no change │
│ QQuery 44 │         13.28 / 14.93 ±1.11 / 16.72 ms │         13.76 / 14.68 ±0.98 / 16.47 ms │     no change │
│ QQuery 45 │         67.84 / 69.09 ±0.78 / 69.93 ms │         66.93 / 68.92 ±1.61 / 71.61 ms │     no change │
│ QQuery 46 │      351.49 / 357.46 ±3.45 / 360.46 ms │      350.01 / 356.31 ±6.20 / 367.61 ms │     no change │
│ QQuery 47 │      776.52 / 784.04 ±4.29 / 788.85 ms │      787.27 / 792.40 ±4.71 / 799.70 ms │     no change │
│ QQuery 48 │      289.62 / 293.65 ±3.19 / 299.01 ms │      289.92 / 293.71 ±4.30 / 301.57 ms │     no change │
│ QQuery 49 │      263.06 / 266.08 ±1.87 / 268.50 ms │      263.10 / 266.34 ±2.30 / 269.53 ms │     no change │
│ QQuery 50 │      223.20 / 228.19 ±3.45 / 233.34 ms │      222.64 / 227.46 ±2.60 / 229.49 ms │     no change │
│ QQuery 51 │      233.45 / 235.97 ±1.63 / 238.16 ms │      232.33 / 235.86 ±1.82 / 237.57 ms │     no change │
│ QQuery 52 │      166.40 / 169.32 ±2.91 / 174.69 ms │      165.89 / 169.02 ±2.81 / 173.25 ms │     no change │
│ QQuery 53 │      172.54 / 177.10 ±3.42 / 181.45 ms │      172.57 / 174.58 ±2.66 / 179.48 ms │     no change │
│ QQuery 54 │      263.87 / 268.90 ±3.64 / 275.08 ms │      262.69 / 266.41 ±2.69 / 269.82 ms │     no change │
│ QQuery 55 │      166.15 / 170.75 ±3.88 / 175.36 ms │      167.40 / 169.89 ±2.02 / 172.17 ms │     no change │
│ QQuery 56 │      218.43 / 221.87 ±2.44 / 225.25 ms │      216.48 / 220.21 ±2.63 / 224.45 ms │     no change │
│ QQuery 57 │      365.47 / 370.76 ±2.75 / 373.52 ms │      367.57 / 370.27 ±2.09 / 373.98 ms │     no change │
│ QQuery 58 │      407.10 / 411.32 ±3.97 / 418.41 ms │      401.55 / 406.93 ±4.33 / 412.65 ms │     no change │
│ QQuery 59 │      302.54 / 305.96 ±3.53 / 312.55 ms │      300.57 / 304.10 ±3.01 / 309.13 ms │     no change │
│ QQuery 60 │      213.66 / 221.18 ±3.87 / 224.54 ms │      213.89 / 218.79 ±2.70 / 221.75 ms │     no change │
│ QQuery 61 │      283.22 / 289.00 ±3.08 / 291.65 ms │     287.53 / 295.20 ±10.14 / 314.96 ms │     no change │
│ QQuery 62 │         85.89 / 86.47 ±0.81 / 88.06 ms │         85.05 / 86.48 ±1.23 / 87.82 ms │     no change │
│ QQuery 63 │      174.83 / 177.13 ±1.68 / 178.94 ms │      169.32 / 174.33 ±2.63 / 176.99 ms │     no change │
│ QQuery 64 │   1503.74 / 1515.47 ±8.12 / 1525.24 ms │  1484.54 / 1501.82 ±13.93 / 1523.57 ms │     no change │
│ QQuery 65 │      345.94 / 350.83 ±2.80 / 353.72 ms │      341.12 / 346.42 ±4.39 / 353.53 ms │     no change │
│ QQuery 66 │      300.60 / 307.20 ±4.00 / 312.86 ms │      297.48 / 304.50 ±5.78 / 312.40 ms │     no change │
│ QQuery 67 │      278.96 / 281.77 ±1.67 / 283.97 ms │      275.92 / 277.54 ±1.95 / 279.96 ms │     no change │
│ QQuery 68 │      415.41 / 421.22 ±3.79 / 427.04 ms │      407.15 / 417.74 ±6.70 / 427.53 ms │     no change │
│ QQuery 69 │      250.27 / 253.62 ±2.82 / 258.10 ms │      249.46 / 252.23 ±2.85 / 256.73 ms │     no change │
│ QQuery 70 │      448.70 / 454.63 ±4.68 / 460.17 ms │      448.12 / 452.61 ±3.36 / 456.13 ms │     no change │
│ QQuery 71 │      208.27 / 213.23 ±3.89 / 217.10 ms │      210.67 / 214.01 ±2.34 / 217.91 ms │     no change │
│ QQuery 72 │ 8631.18 / 9039.29 ±214.51 / 9262.42 ms │ 8601.65 / 8993.52 ±200.59 / 9158.60 ms │     no change │
│ QQuery 73 │      187.58 / 195.62 ±7.08 / 208.47 ms │     186.49 / 194.93 ±10.84 / 216.02 ms │     no change │
│ QQuery 74 │      596.97 / 603.71 ±6.61 / 615.81 ms │      595.57 / 600.61 ±2.76 / 603.67 ms │     no change │
│ QQuery 75 │      526.78 / 533.70 ±5.03 / 539.59 ms │      514.71 / 526.44 ±8.76 / 540.12 ms │     no change │
│ QQuery 76 │      137.01 / 139.62 ±1.86 / 142.75 ms │      133.83 / 136.41 ±2.55 / 139.92 ms │     no change │
│ QQuery 77 │      280.69 / 286.88 ±4.91 / 292.71 ms │      275.77 / 281.72 ±3.50 / 285.57 ms │     no change │
│ QQuery 78 │     579.06 / 609.04 ±42.28 / 692.41 ms │      561.23 / 577.90 ±8.65 / 586.58 ms │ +1.05x faster │
│ QQuery 79 │      349.04 / 353.00 ±3.05 / 357.53 ms │      345.08 / 349.73 ±3.19 / 353.32 ms │     no change │
│ QQuery 80 │     460.02 / 475.51 ±12.75 / 489.17 ms │     457.15 / 476.57 ±12.75 / 493.20 ms │     no change │
│ QQuery 81 │         59.25 / 62.92 ±3.81 / 68.50 ms │         60.38 / 64.02 ±3.56 / 70.64 ms │     no change │
│ QQuery 82 │      333.53 / 336.36 ±1.95 / 339.05 ms │      329.63 / 333.62 ±3.29 / 338.15 ms │     no change │
│ QQuery 83 │         67.62 / 69.40 ±2.15 / 73.35 ms │         67.61 / 69.59 ±1.67 / 72.35 ms │     no change │
│ QQuery 84 │         69.61 / 70.93 ±1.15 / 72.51 ms │         68.58 / 70.97 ±2.35 / 75.45 ms │     no change │
│ QQuery 85 │      182.55 / 186.37 ±2.50 / 190.23 ms │      176.28 / 184.16 ±5.63 / 190.91 ms │     no change │
│ QQuery 86 │         46.72 / 46.97 ±0.24 / 47.30 ms │         46.13 / 47.54 ±0.73 / 48.19 ms │     no change │
│ QQuery 87 │      202.38 / 203.47 ±0.92 / 205.17 ms │      201.89 / 205.48 ±3.37 / 211.41 ms │     no change │
│ QQuery 88 │      320.11 / 326.13 ±3.68 / 330.33 ms │      315.90 / 321.41 ±4.44 / 329.25 ms │     no change │
│ QQuery 89 │      182.00 / 186.30 ±5.34 / 196.72 ms │      178.45 / 182.86 ±3.71 / 188.65 ms │     no change │
│ QQuery 90 │         42.47 / 43.62 ±0.64 / 44.41 ms │         42.55 / 43.87 ±1.05 / 45.55 ms │     no change │
│ QQuery 91 │         70.27 / 72.56 ±1.75 / 75.08 ms │         71.58 / 72.67 ±1.37 / 75.15 ms │     no change │
│ QQuery 92 │         84.50 / 85.34 ±0.68 / 86.37 ms │         82.88 / 84.70 ±1.97 / 87.41 ms │     no change │
│ QQuery 93 │      394.12 / 406.33 ±6.42 / 413.14 ms │      385.99 / 396.87 ±7.44 / 405.39 ms │     no change │
│ QQuery 94 │        96.52 / 99.49 ±1.85 / 101.94 ms │         93.33 / 95.99 ±2.23 / 99.10 ms │     no change │
│ QQuery 95 │      179.48 / 180.96 ±1.48 / 183.23 ms │      174.01 / 175.62 ±1.04 / 176.87 ms │     no change │
│ QQuery 96 │      131.81 / 133.80 ±1.51 / 135.80 ms │      132.59 / 133.95 ±0.99 / 134.92 ms │     no change │
│ QQuery 97 │      206.72 / 209.19 ±2.27 / 213.20 ms │      205.57 / 207.44 ±1.36 / 209.01 ms │     no change │
│ QQuery 98 │      191.78 / 195.71 ±2.51 / 198.31 ms │      189.08 / 190.64 ±1.33 / 192.79 ms │     no change │
│ QQuery 99 │      217.57 / 219.39 ±1.48 / 221.84 ms │      216.86 / 219.60 ±2.13 / 222.20 ms │     no change │
└───────────┴────────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 44200.74ms │
│ Total Time (simplify_smj_full_opt)   │ 43976.36ms │
│ Average Time (HEAD)                  │   446.47ms │
│ Average Time (simplify_smj_full_opt) │   444.21ms │
│ Queries Faster                       │          1 │
│ Queries Slower                       │          1 │
│ Queries with No Change               │         97 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric	Value
Wall time	221.4s
Peak memory	28.4 GiB
Avg memory	6.7 GiB
CPU user	898.7s
CPU sys	90.5s
Disk read	0 B
Disk write	639.4 MiB

tpcds — branch

Metric	Value
Wall time	220.2s
Peak memory	27.8 GiB
Avg memory	6.4 GiB
CPU user	899.1s
CPU sys	91.4s
Disk read	0 B
Disk write	792.0 KiB

File an issue against this benchmark runner

Dandandan · 2026-03-28T05:32:04Z

Nice, this runs correctly with SMJ, but it doesn't impact those joins it seems.

Dandandan · 2026-03-28T05:32:33Z

run benchmark tpch

env:
   PREFER_HASH_JOIN: false

Dandandan · 2026-03-28T05:32:50Z

run benchmark tpch10

env:
   PREFER_HASH_JOIN: false

adriangbot · 2026-03-28T05:35:06Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4147233004-594-265h7 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing simplify_smj_full_opt (fa4e99d) to a910b03 (merge-base) diff using: tpch
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-28T05:35:07Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4147234569-595-gnq4p 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing simplify_smj_full_opt (fa4e99d) to a910b03 (merge-base) diff using: tpch10
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-03-28T05:43:49Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                              HEAD ┃             simplify_smj_full_opt ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │    45.16 / 45.88 ±0.80 / 47.40 ms │    45.57 / 46.06 ±0.73 / 47.51 ms │     no change │
│ QQuery 2  │    62.42 / 62.64 ±0.24 / 63.10 ms │    62.16 / 63.28 ±0.95 / 64.43 ms │     no change │
│ QQuery 3  │    65.22 / 68.77 ±2.41 / 72.18 ms │    65.96 / 67.98 ±2.17 / 72.05 ms │     no change │
│ QQuery 4  │    48.99 / 50.28 ±0.67 / 50.91 ms │    48.31 / 50.38 ±1.71 / 53.26 ms │     no change │
│ QQuery 5  │ 109.36 / 112.16 ±2.29 / 115.46 ms │ 106.92 / 110.77 ±2.87 / 115.49 ms │     no change │
│ QQuery 6  │    17.20 / 17.43 ±0.24 / 17.88 ms │    17.24 / 17.59 ±0.40 / 18.38 ms │     no change │
│ QQuery 7  │ 131.27 / 134.15 ±3.23 / 138.63 ms │ 134.58 / 136.98 ±2.02 / 140.69 ms │     no change │
│ QQuery 8  │ 122.42 / 126.76 ±2.50 / 129.65 ms │ 126.29 / 128.17 ±1.35 / 130.17 ms │     no change │
│ QQuery 9  │ 179.96 / 183.14 ±3.87 / 190.48 ms │ 177.55 / 183.43 ±5.32 / 191.75 ms │     no change │
│ QQuery 10 │    86.50 / 88.68 ±2.44 / 93.31 ms │    84.79 / 86.42 ±1.82 / 89.80 ms │     no change │
│ QQuery 11 │    45.72 / 46.50 ±0.51 / 47.13 ms │    45.62 / 46.15 ±0.42 / 46.75 ms │     no change │
│ QQuery 12 │    39.05 / 39.47 ±0.37 / 40.05 ms │    38.81 / 40.91 ±2.60 / 45.78 ms │     no change │
│ QQuery 13 │    49.36 / 50.00 ±0.70 / 51.37 ms │    49.03 / 49.68 ±0.73 / 50.95 ms │     no change │
│ QQuery 14 │    29.40 / 29.83 ±0.41 / 30.58 ms │    29.11 / 29.44 ±0.41 / 30.21 ms │     no change │
│ QQuery 15 │    30.89 / 31.70 ±0.85 / 33.30 ms │    30.49 / 31.66 ±1.04 / 33.57 ms │     no change │
│ QQuery 16 │    21.82 / 22.47 ±0.39 / 22.88 ms │    21.46 / 22.11 ±0.47 / 22.84 ms │     no change │
│ QQuery 17 │ 118.02 / 122.06 ±2.26 / 124.75 ms │ 116.62 / 119.05 ±2.16 / 122.39 ms │     no change │
│ QQuery 18 │ 144.92 / 147.07 ±1.52 / 149.44 ms │ 146.03 / 147.04 ±0.88 / 148.57 ms │     no change │
│ QQuery 19 │    39.41 / 40.76 ±1.12 / 42.71 ms │    38.93 / 40.16 ±1.13 / 42.19 ms │     no change │
│ QQuery 20 │    58.39 / 60.15 ±1.54 / 62.47 ms │    57.06 / 59.14 ±1.62 / 61.23 ms │     no change │
│ QQuery 21 │ 267.54 / 270.49 ±2.39 / 274.22 ms │ 262.90 / 273.23 ±6.02 / 281.81 ms │     no change │
│ QQuery 22 │    38.93 / 39.73 ±0.62 / 40.38 ms │    36.10 / 37.59 ±1.19 / 39.38 ms │ +1.06x faster │
└───────────┴───────────────────────────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1790.14ms │
│ Total Time (simplify_smj_full_opt)   │ 1787.22ms │
│ Average Time (HEAD)                  │   81.37ms │
│ Average Time (simplify_smj_full_opt) │   81.24ms │
│ Queries Faster                       │         1 │
│ Queries Slower                       │         0 │
│ Queries with No Change               │        21 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

Resource Usage

tpch — base (merge-base)

Metric	Value
Wall time	9.3s
Peak memory	4.8 GiB
Avg memory	4.0 GiB
CPU user	73.6s
CPU sys	4.4s
Disk read	0 B
Disk write	136.0 KiB

tpch — branch

Metric	Value
Wall time	9.2s
Peak memory	4.8 GiB
Avg memory	4.0 GiB
CPU user	73.7s
CPU sys	4.4s
Disk read	0 B
Disk write	65.1 MiB

File an issue against this benchmark runner

adriangbot · 2026-03-28T05:47:00Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and simplify_smj_full_opt
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                  simplify_smj_full_opt ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      372.58 / 374.17 ±0.97 / 375.04 ms │      371.59 / 372.29 ±0.39 / 372.62 ms │     no change │
│ QQuery 2  │      519.95 / 530.22 ±7.65 / 541.91 ms │      521.26 / 531.16 ±8.37 / 546.18 ms │     no change │
│ QQuery 3  │     686.11 / 709.99 ±23.92 / 751.20 ms │     594.45 / 682.54 ±50.44 / 744.85 ms │     no change │
│ QQuery 4  │     540.18 / 611.91 ±42.12 / 659.84 ms │     468.79 / 575.33 ±82.95 / 660.57 ms │ +1.06x faster │
│ QQuery 5  │  1128.05 / 1158.47 ±35.31 / 1220.24 ms │  1133.18 / 1157.66 ±20.99 / 1190.05 ms │     no change │
│ QQuery 6  │      135.94 / 137.87 ±2.85 / 143.50 ms │      135.49 / 139.35 ±4.95 / 148.99 ms │     no change │
│ QQuery 7  │  1565.54 / 1584.72 ±16.91 / 1616.22 ms │  1597.54 / 1626.15 ±21.06 / 1658.61 ms │     no change │
│ QQuery 8  │ 1645.91 / 1807.99 ±213.23 / 2212.27 ms │ 1477.54 / 1957.30 ±313.64 / 2225.01 ms │  1.08x slower │
│ QQuery 9  │  2079.09 / 2136.46 ±40.16 / 2192.58 ms │  2116.16 / 2174.53 ±36.27 / 2230.67 ms │     no change │
│ QQuery 10 │     524.17 / 540.09 ±10.37 / 552.77 ms │      540.02 / 548.74 ±6.45 / 559.41 ms │     no change │
│ QQuery 11 │      473.13 / 477.98 ±2.72 / 481.58 ms │      476.92 / 479.90 ±3.29 / 486.07 ms │     no change │
│ QQuery 12 │     310.27 / 319.43 ±11.13 / 340.32 ms │     314.45 / 335.97 ±24.68 / 379.51 ms │  1.05x slower │
│ QQuery 13 │      374.46 / 386.37 ±6.23 / 392.65 ms │     375.88 / 390.24 ±14.34 / 417.84 ms │     no change │
│ QQuery 14 │      200.46 / 203.07 ±1.65 / 204.97 ms │      200.75 / 204.06 ±2.83 / 207.31 ms │     no change │
│ QQuery 15 │      332.39 / 338.74 ±5.32 / 346.88 ms │      318.60 / 325.68 ±4.65 / 332.42 ms │     no change │
│ QQuery 16 │      125.08 / 130.71 ±3.62 / 136.38 ms │      125.96 / 127.36 ±1.49 / 130.18 ms │     no change │
│ QQuery 17 │ 1604.39 / 1735.71 ±149.87 / 1927.07 ms │ 1586.53 / 1660.55 ±139.47 / 1939.42 ms │     no change │
│ QQuery 18 │  1686.97 / 1707.54 ±11.29 / 1721.40 ms │  1682.94 / 1712.31 ±20.81 / 1737.98 ms │     no change │
│ QQuery 19 │     278.53 / 287.96 ±12.06 / 311.32 ms │      280.32 / 286.73 ±9.75 / 305.94 ms │     no change │
│ QQuery 20 │      467.80 / 473.38 ±3.82 / 479.04 ms │      459.11 / 474.38 ±9.01 / 487.32 ms │     no change │
│ QQuery 21 │ 3154.87 / 3480.40 ±181.03 / 3675.57 ms │ 3550.26 / 3704.38 ±103.20 / 3826.78 ms │  1.06x slower │
│ QQuery 22 │      191.37 / 197.79 ±8.42 / 214.44 ms │      197.07 / 200.91 ±5.60 / 211.66 ms │     no change │
└───────────┴────────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 19330.96ms │
│ Total Time (simplify_smj_full_opt)   │ 19667.53ms │
│ Average Time (HEAD)                  │   878.68ms │
│ Average Time (simplify_smj_full_opt) │   893.98ms │
│ Queries Faster                       │          1 │
│ Queries Slower                       │          3 │
│ Queries with No Change               │         18 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric	Value
Wall time	97.0s
Peak memory	11.8 GiB
Avg memory	8.6 GiB
CPU user	909.2s
CPU sys	75.7s
Disk read	0 B
Disk write	3.0 GiB

tpch10 — branch

Metric	Value
Wall time	98.7s
Peak memory	11.1 GiB
Avg memory	8.6 GiB
CPU user	916.0s
CPU sys	75.3s
Disk read	0 B
Disk write	84.0 KiB

File an issue against this benchmark runner

stuhood · 2026-03-29T02:29:43Z

No useful comments to add, just: thank you, love the direction!

# Conflicts: # datafusion/physical-plan/src/joins/sort_merge_join/metrics.rs

…#21200) ## Which issue does this PR close?  - Closes #. ## Rationale for this change  Our SMJ benchmark queries finish too quickly to demonstrate improvements that aren't massive. For example, I am working on an optimization that introduces `DynComparator` (part of apache#20910) and it's about a 10% improvement, but only when you actually make the queries run long enough. The new queries for apache#21184 are scaled enough to see improvements, but we need to scale the older queries. I am also continuing to see SMJ issues with Comet when running joins with billions (sometimes trillions) of rows. We can't do that for microbenchmarks, but we can at least start hitting millions of rows to look at more than a handful of batches. ## What changes are included in this PR?  Bring our SMJ queries into alignment with some of the newer ones (Q21-23) to demonstrate further performance wins. ## Are these changes tested?  I ran the benchmark. On my M3 Max, here's how long it takes: | Query | Join Type | Rows | Keys | Filter | Median (ms) | |-------|-----------|------|------|--------|-------------| | Q1 | INNER | 1M×1M | 1:1 | — | 16.3 | | Q2 | INNER | 1M×10M | 1:10 | — | 117.4 | | Q3 | INNER | 1M×1M | 1:100 | — | 74.2 | | Q4 | INNER | 1M×10M | 1:10 | 1% | 17.1 | | Q5 | INNER | 1M×1M | 1:100 | 10% | 18.4 | | Q6 | LEFT | 1M×10M | 1:10 | — | 129.3 | | Q7 | LEFT | 1M×10M | 1:10 | 50% | 150.2 | | Q8 | FULL | 1M×1M | 1:10 | — | 16.6 | | Q9 | FULL | 1M×10M | 1:10 | 10% | 153.5 | | Q10 | LEFT SEMI | 1M×10M | 1:10 | — | 53.1 | | Q11 | LEFT SEMI | 1M×10M | 1:10 | 1% | 15.5 | | Q12 | LEFT SEMI | 1M×10M | 1:10 | 50% | 65.0 | | Q13 | LEFT SEMI | 1M×10M | 1:10 | 90% | 105.7 | | Q14 | LEFT ANTI | 1M×10M | 1:10 | — | 54.3 | | Q15 | LEFT ANTI | 1M×10M | 1:10 | partial | 51.5 | | Q16 | LEFT ANTI | 1M×1M | 1:1 | — | 10.3 | | Q17 | INNER | 1M×50M | 1:50 | 5% | 75.9 | | Q18 | LEFT SEMI | 1M×50M | 1:50 | 2% | 50.2 | | Q19 | LEFT ANTI | 1M×50M | 1:50 | partial | 336.4 | | Q20 | INNER | 1M×10M | 1:100 | GROUP BY | 763.7 | | Q21 | INNER | 10M×10M | 1:1 | 50% | 186.1 | | Q22 | LEFT | 10M×10M | 1:1 | 50% | 10,193.8 | | Q23 | FULL | 10M×10M | 1:1 | 50% | 10,194.7 | Note that Q22 and Q23 will be about 20x faster when apache#21184 merges, so taking 10 seconds to run is just a short-term issue. ## Are there any user-facing changes?   No.

…ins to bitwise stream. Near-unique LEFT and FULL SMJ 20-50x faster (apache#21184) ## Which issue does this PR close? Partially addresses apache#20910. Fixes apache#21197. ## Rationale for this change Sort-merge join with a filter on outer joins (LEFT/RIGHT/FULL) runs `process_filtered_batches()` on every key transition in the Init state. With near-unique keys (1:1 cardinality), this means running the full deferred filtering pipeline (concat + `get_corrected_filter_mask` + `filter_record_batch_by_join_type`) once per row — making filtered LEFT/RIGHT/FULL **55x slower** than INNER for 10M unique keys. Additionally, mark join logic in `MaterializingSortMergeJoinStream` materializes full `(streamed, buffered)` pairs only to discard most of them via `get_corrected_filter_mask()`. Mark joins are structurally identical to semi joins (one output row per outer row with a boolean result) and belong in `BitwiseSortMergeJoinStream`, which avoids pair materialization entirely using a per-outer-batch bitset. ## What changes are included in this PR? Three areas of improvement, building on the specialized semi/anti stream from apache#20806: **1. Move mark joins to `BitwiseSortMergeJoinStream`** - Match on join type; `emit_outer_batch()` emits all rows with the match bitset as a boolean column (vs semi's filter / anti's invert-and-filter) - Route `LeftMark`/`RightMark` from `SortMergeJoinExec::execute()` to the bitwise stream - Remove all mark-specific logic from `MaterializingSortMergeJoinStream` (`mark_row_as_match`, `is_not_null` column generation, mark arms in filter correction) **2. Batch filter evaluation in `freeze_streamed()`** - Split `freeze_streamed()` into null-joined classification + `freeze_streamed_matched()` for batched materialization - Collect indices across chunks, materialize left/right columns once using tiered Arrow kernels (`slice` → `take` → `interleave`) - Single `RecordBatch` construction and single `expression.evaluate()` per freeze instead of per chunk - Vectorize `append_filter_metadata()` using builder `extend()` instead of per-element loop **3. Batch deferred filtering in Init state** (this is the big win for Q22 and Q23) - Gate `process_filtered_batches()` on accumulated rows >= `batch_size` instead of running on every Init entry - Accumulated data bounded to ~2×batch_size (one from `freeze_dequeuing_buffered`, one accumulating toward next freeze) — does not reintroduce unbounded buffering fixed by PR apache#20482 - `Exhausted` state flushes any remainder **Cleanup:** - Rename `SortMergeJoinStream` → `MaterializingSortMergeJoinStream` (materializes explicit row pairs for join output) and `SemiAntiMarkSortMergeJoinStream` → `BitwiseSortMergeJoinStream` (tracks matches via boolean bitset) - Consolidate `semi_anti_mark_sort_merge_join/` into `sort_merge_join/` as `bitwise_stream.rs` / `bitwise_tests.rs`; rename `stream.rs` → `materializing_stream.rs` and `tests.rs` → `materializing_tests.rs` - Consolidate `SpillManager` construction into `SortMergeJoinExec::execute()` (shared across both streams); move `peak_mem_used` gauge into `BitwiseSortMergeJoinStream::try_new` - `MaterializingSortMergeJoinStream` now handles only Inner/Left/Right/Full — all semi/anti/mark branching removed - `get_corrected_filter_mask()`: merge identical Left/Right/Full branches; add null-metadata passthrough for already-null-joined rows - `filter_record_batch_by_join_type()`: rewrite from `filter(true) + filter(false) + concat` to `zip()` for in-place null-joining — preserves row ordering and removes `create_null_joined_batch()` entirely; add early return for empty batches - `filter_record_batch_by_join_type()`: use `compute::filter()` directly on `BooleanArray` instead of wrapping in temporary `RecordBatch` ## Benchmarks `cargo run --release --bin dfbench -- smj` | Query | Join Type | Rows | Keys | Filter | Main (ms) | PR (ms) | Speedup | |-------|-----------|------|------|--------|-----------|---------|---------| | Q1 | INNER | 1M×1M | 1:1 | — | 16.3 | 14.4 | 1.1x | | Q2 | INNER | 1M×10M | 1:10 | — | 117.4 | 120.1 | 1.0x | | Q3 | INNER | 1M×1M | 1:100 | — | 74.2 | 66.6 | 1.1x | | Q4 | INNER | 1M×10M | 1:10 | 1% | 17.1 | 15.1 | 1.1x | | Q5 | INNER | 1M×1M | 1:100 | 10% | 18.4 | 14.4 | 1.3x | | Q6 | LEFT | 1M×10M | 1:10 | — | 129.3 | 122.7 | 1.1x | | Q7 | LEFT | 1M×10M | 1:10 | 50% | 150.2 | 142.2 | 1.1x | | Q8 | FULL | 1M×1M | 1:10 | — | 16.6 | 16.7 | 1.0x | | Q9 | FULL | 1M×10M | 1:10 | 10% | 153.5 | 136.2 | 1.1x | | Q10 | LEFT SEMI | 1M×10M | 1:10 | — | 53.1 | 53.1 | 1.0x | | Q11 | LEFT SEMI | 1M×10M | 1:10 | 1% | 15.5 | 14.7 | 1.1x | | Q12 | LEFT SEMI | 1M×10M | 1:10 | 50% | 65.0 | 67.3 | 1.0x | | Q13 | LEFT SEMI | 1M×10M | 1:10 | 90% | 105.7 | 109.8 | 1.0x | | Q14 | LEFT ANTI | 1M×10M | 1:10 | — | 54.3 | 53.9 | 1.0x | | Q15 | LEFT ANTI | 1M×10M | 1:10 | partial | 51.5 | 50.5 | 1.0x | | Q16 | LEFT ANTI | 1M×1M | 1:1 | — | 10.3 | 11.3 | 0.9x | | Q17 | INNER | 1M×50M | 1:50 | 5% | 75.9 | 79.0 | 1.0x | | Q18 | LEFT SEMI | 1M×50M | 1:50 | 2% | 50.2 | 49.0 | 1.0x | | Q19 | LEFT ANTI | 1M×50M | 1:50 | partial | 336.4 | 344.2 | 1.0x | | Q20 | INNER | 1M×10M | 1:100 | GROUP BY | 763.7 | 803.9 | 1.0x | | Q21 | INNER | 10M×10M | 1:1 | 50% | 186.1 | 187.8 | 1.0x | | Q22 | LEFT | 10M×10M | 1:1 | 50% | 10,193.8 | 185.8 | **54.9x** | | Q23 | FULL | 10M×10M | 1:1 | 50% | 10,194.7 | 233.6 | **43.6x** | | Q24 | LEFT MARK | 1M×10M | 1:10 | 1% | FAILS | 15.1 | — | | Q25 | LEFT MARK | 1M×10M | 1:10 | 50% | FAILS | 67.3 | — | | Q26 | LEFT MARK | 1M×10M | 1:10 | 90% | FAILS | 110.0 | — | General workload (Q1-Q20, various join types/cardinalities/selectivities): no regressions. ## Are these changes tested? In addition to existing unit and sqllogictests: - I ran 50 iterations of the fuzz tests (modified to only test against hash join as the baseline because nested loop join takes too long) `cargo test -p datafusion --features extended_tests --test fuzz -- join_fuzz` - One new sqllogictest for apache#21197 that fails on main - Four new unit tests: three for full join with filter that spills - One new fuzz test to exercise full join with filter that spills - New benchmark queries Q21-Q23: 10M×10M unique keys with 50% join filter for INNER/LEFT/FULL — exercises the degenerate case this PR fixes - New benchmark queries Q24-Q26 duplicated Q11-Q13 but for Mark joins, showing that they have the same performance as other joins (`LeftSemi`) that use this stream ## Are there any user-facing changes? No.

mbutrovich added 11 commits March 25, 2026 16:43

Remove dead code.

974d305

More cleanup.

3de5c0c

move mark logic

a570c6c

move mark logic

e4fcdd1

add benchmark, optimize remaining smj stream

e66a44b

clean up, debug_asserts

640dddc

add a new test

f71e161

scale benchmark

cd799a6

Batch deferred filtering for outer joins with unique keys

d922b9b

add comments

14d9653

Merge branch 'main' into simplify_smj_full_opt

5779054

github-actions Bot added the physical-plan Changes to the physical-plan crate label Mar 26, 2026

clippy fix.

1c1bec5

mbutrovich requested review from comphead and rluvaton March 26, 2026 18:26

mbutrovich marked this pull request as ready for review March 26, 2026 18:29

mbutrovich mentioned this pull request Mar 26, 2026

[EPIC] Benchmark improvements #21165

Open

rluvaton reviewed Mar 26, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/sort_merge_join/filter.rs Outdated

rluvaton reviewed Mar 26, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/sort_merge_join/filter.rs

mbutrovich added 4 commits March 26, 2026 15:19

add clarifying comment

9fc21a0

remove booleans is_semi is_mark and just use JoinType enum.

66632ef

clean up redundant comment next to already-verbose unreachable! macro.

60127a7

clearer debug_assert messages

481753a

comphead reviewed Mar 27, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/joins/sort_merge_join/stream.rs

mbutrovich added 2 commits March 27, 2026 14:47

address some PR feedback. Rename SMJ streams, collapse the bitwise (S…

3be0029

…emiAntiMark) into the same SMJ folder.

add renamed files.

0c39df3

mbutrovich requested a review from comphead March 27, 2026 18:58

mbutrovich commented Mar 27, 2026

View reviewed changes

mbutrovich added 2 commits March 27, 2026 15:18

Merge branch 'main' into simplify_smj_full_opt

e077499

combine test files.

fa4e99d

comphead approved these changes Mar 27, 2026

View reviewed changes

mbutrovich added 2 commits March 30, 2026 09:58

Merge branch 'main' into simplify_smj_full_opt

2ca0b09

# Conflicts: # datafusion/physical-plan/src/joins/sort_merge_join/metrics.rs

Merge branch 'main' into simplify_smj_full_opt

f1d40b9

mbutrovich added this pull request to the merge queue Mar 30, 2026

Merged via the queue into apache:main with commit 0be5982 Mar 30, 2026
34 checks passed

mbutrovich deleted the simplify_smj_full_opt branch March 30, 2026 15:40

Conversation

mbutrovich commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmarks

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Mar 26, 2026

Uh oh!

rluvaton commented Mar 26, 2026

Uh oh!

mbutrovich commented Mar 26, 2026

Uh oh!

adriangbot commented Mar 26, 2026

Uh oh!

adriangbot commented Mar 26, 2026

Uh oh!

mbutrovich commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

Dandandan commented Mar 28, 2026

Uh oh!

Dandandan commented Mar 28, 2026

Uh oh!

Dandandan commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

adriangbot commented Mar 28, 2026

Uh oh!

stuhood commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mbutrovich commented Mar 26, 2026 •

edited

Loading

mbutrovich commented Mar 26, 2026 •

edited

Loading

mbutrovich Mar 27, 2026 •

edited

Loading