Add configurable UNION DISTINCT to FILTER rewrite optimization by xiedeyantu · Pull Request #21075 · apache/datafusion

xiedeyantu · 2026-03-20T10:23:59Z

Which issue does this PR close?
- Closes Add configurable UNION DISTINCT support to FILTER rewrite optimization #21310.
Rationale for this change

This PR adds a configurable optimizer rewrite for UNION DISTINCT queries. The goal is to allow the optimizer to collapse eligible union branches into a single filtered scan when the branches read from the same source and differ only by filter predicates.

This optimization can reduce duplicated work and avoid scanning the same input multiple times. Keeping it behind a configuration flag makes the behavior explicit and safe to enable only when desired.

What changes are included in this PR?
- Adds a new optimizer configuration option: datafusion.optimizer.enable_unions_to_filter, which is disabled by default.
- Enables the UnionsToFilter optimization rule in the logical optimizer pipeline.
- Adds documentation for the new configuration option, including plan-shape examples.
- Extends sqllogictest coverage in datafusion/sqllogictest/test_files/union.slt to cover both the disabled and enabled cases.
- Verifies that the rewrite only applies to eligible UNION DISTINCT queries.
Example rewrite

When the rule is enabled, a query such as:

SQL
```
SELECT id, name FROM t1 WHERE id = 1
UNION
SELECT id, name FROM t1 WHERE id = 2
```
may be rewritten into an equivalent plan that scans t1 once and applies a combined filter such as:

SQL
```
SELECT id, name FROM t1 WHERE id = 1 OR id = 2
```
This keeps the results unchanged while avoiding repeated reads from the same source.

Are these changes tested?

Yes.

The new behavior is covered by sqllogictest cases that validate both plan variants:
- the original UNION DISTINCT execution path when the option is disabled
- the rewritten single-scan plan when the option is enabled
Are there any user-facing changes?

Yes.

A new configuration option is introduced:
- datafusion.optimizer.enable_unions_to_filter
When enabled, some UNION DISTINCT queries may be optimized into a different plan shape. Query results remain the same, but the execution plan may change.

xiedeyantu · 2026-03-21T03:54:30Z

Hi @alamb , here's another PR related to plan optimization. Do we need it? I'd also like to know what aspects of optimization we accept.

xiedeyantu · 2026-03-21T13:55:42Z

Hi @Dandandan , I noticed that you’re very knowledgeable about SQL optimization. It would be great if you could help me review this!

alamb · 2026-03-22T13:26:15Z

@xiedeyantu -- can you please file a ticket explaining what usecase you are targeting with this optimization?

Your explanation in

Rationale for this change

Mostly focuses on "what" is changed, not the "why"

In terms of usecase I think what is important:

Example SQL queries that show the pattern you are optimizing for

Then for this optimization it would be great to have some benchmark numbers showing that the query of interest iindeed gets faster with this optimization compared than without it

xiedeyantu · 2026-03-22T13:57:19Z

datafusion/sqllogictest/test_files/union.slt

+
+query TT
+EXPLAIN SELECT id, name FROM t1 WHERE id = 1 UNION SELECT id, name FROM t1 WHERE id = 2
+----


@alamb Here is an example: There is a rule for eliminating UNION under specific conditions. It applies when the branches of the UNION come from the same table and only differ in their WHERE conditions. This rule allows us to avoid an extra table scan — we only need to perform a single combined conditional filter.

I haven't looked at the proposed implementation but the rewrite can surely help in case of repeated costly union branches (especially when coming from possibly complex data sources powered via TableProvider).

It seems also particularly relevant until #8777 gets addressed (broader scope, CTE materialization), as currently there is no other way to mutualize repeated reads AFAIK.

Thank you @asolimando , for connecting me to such a valuable discussion. I think the idea in #8777 is excellent, but it seems we might need to perform a union operation once more. It looks like a final conclusion hasn't been reached yet? However, this approach can support UNION ALL. It seems the two might complement each other?

xiedeyantu · 2026-03-29T05:29:45Z

I tested the execution times, and in reality, the difference between the two was not significant. My understanding is that the multiple branches within a UNION operation can be processed in parallel; therefore, one would not expect to see a substantial reduction in overall execution time (and if a significant improvement were observed, it would likely indicate an issue elsewhere). Consequently, I had to rely on EXPLAIN ANALYZE to inspect the execution plan, thereby demonstrating that—prior to optimization—the data required two separate scans, whereas after optimization, only a single scan was necessary.

The test SQL script is as follows:

# test_data.csv
id,category,amount,created_at
1,A,10,2026-03-29 00:00:01
2,B,20,2026-03-29 00:00:02
3,C,30,2026-03-29 00:00:03
......
9999,D,99990,2026-03-29 02:46:39
10000,E,100000,2026-03-29 02:46:40

# create table
CREATE EXTERNAL TABLE t (
  id INT,
  category STRING,
  amount INT,
  created_at TIMESTAMP
)
STORED AS CSV
LOCATION 'test_data.csv'
OPTIONS (
  has_header 'true'
);

set datafusion.optimizer.enable_unions_to_filter=false;
EXPLAIN ANALYZE 
SELECT category FROM t WHERE id > 5
UNION
SELECT category FROM t WHERE id < 10;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0 as category], aggr=[], metrics=[output_rows=5, elapsed_compute=512.96µs, output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, peak_mem_used=3.49 K, aggregate_arguments_time=8ns, aggregation_time=8ns, emitting_time=5.51µs, time_calculating_group_ids=21.34µs]                                                                                                                                |
|                   |   RepartitionExec: partitioning=Hash([category@0], 8), input_partitions=16, metrics=[output_rows=15, elapsed_compute=129.16µs, output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=311.11ms, repartition_time=52.80µs, send_time=27.66µs]                                                                                                                                                                                                 |
|                   |     AggregateExec: mode=Partial, gby=[category@0 as category], aggr=[], metrics=[output_rows=15, elapsed_compute=2.50ms, output_bytes=384.0 B, output_batches=3, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=167.2 K, aggregate_arguments_time=16ns, aggregation_time=16ns, emitting_time=16.47µs, time_calculating_group_ids=2.06ms, reduction_factor=0.15% (15/10.00 K)]                                                                  |
|                   |       UnionExec, metrics=[output_rows=10.00 K, elapsed_compute=298.29µs, output_bytes=383.9 KB, output_batches=3]                                                                                                                                                                                                                                                                                                                                                                             |
|                   |         FilterExec: id@0 > 5, projection=[category@1], metrics=[output_rows=9.99 K, elapsed_compute=418.21µs, output_bytes=255.9 KB, output_batches=2, selectivity=100% (9.99 K/10.00 K)]                                                                                                                                                                                                                                                                                                     |
|                   |           RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 K, elapsed_compute=35.17µs, output_bytes=320.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=19.31ms, repartition_time=1ns, send_time=16.59µs]                                                                                                                                                                                              |
|                   |             DataSourceExec: file_groups={1 group: [[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.83ms, output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, file_scan_errors=0, files_opened=1, files_processed=1, time_elapsed_opening=359.75µs, time_elapsed_processing=19.09ms, time_elapsed_scanning_total=18.89ms, time_elapsed_scanning_until_data=15.39ms] |
|                   |         FilterExec: id@0 < 10, projection=[category@1], metrics=[output_rows=9, elapsed_compute=310.84µs, output_bytes=128.0 KB, output_batches=1, selectivity=0.09% (9/10.00 K)]                                                                                                                                                                                                                                                                                                             |
|                   |           RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 K, elapsed_compute=62.12µs, output_bytes=320.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=18.82ms, repartition_time=1ns, send_time=47.76µs]                                                                                                                                                                                              |
|                   |             DataSourceExec: file_groups={1 group: [[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.35ms, output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, file_scan_errors=0, files_opened=1, files_processed=1, time_elapsed_opening=364.33µs, time_elapsed_processing=18.63ms, time_elapsed_scanning_total=18.48ms, time_elapsed_scanning_until_data=15.05ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.038 seconds.

set datafusion.optimizer.enable_unions_to_filter=true;
EXPLAIN ANALYZE 
SELECT category FROM t WHERE id > 5
UNION
SELECT category FROM t WHERE id < 10;
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0 as category], aggr=[], metrics=[output_rows=5, elapsed_compute=374.54µs, output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, peak_mem_used=3.36 K, aggregate_arguments_time=8ns, aggregation_time=8ns, emitting_time=5.50µs, time_calculating_group_ids=19.13µs]                                                                                                                              |
|                   |   RepartitionExec: partitioning=Hash([category@0], 8), input_partitions=8, metrics=[output_rows=10, elapsed_compute=97.63µs, output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=180.39ms, repartition_time=48.84µs, send_time=30.44µs]                                                                                                                                                                                                 |
|                   |     AggregateExec: mode=Partial, gby=[category@0 as category], aggr=[], metrics=[output_rows=10, elapsed_compute=2.00ms, output_bytes=256.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=164.5 K, aggregate_arguments_time=8ns, aggregation_time=8ns, emitting_time=14.63µs, time_calculating_group_ids=1.81ms, reduction_factor=0.1% (10/10.00 K)]                                                                   |
|                   |       FilterExec: id@0 > 5 OR id@0 < 10, projection=[category@1], metrics=[output_rows=10.00 K, elapsed_compute=526.17µs, output_bytes=256.0 KB, output_batches=2, selectivity=100% (10.00 K/10.00 K)]                                                                                                                                                                                                                                                                                      |
|                   |         RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 K, elapsed_compute=58.50µs, output_bytes=320.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=22.05ms, repartition_time=1ns, send_time=25.67µs]                                                                                                                                                                                              |
|                   |           DataSourceExec: file_groups={1 group: [[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=21.49ms, output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, file_scan_errors=0, files_opened=1, files_processed=1, time_elapsed_opening=468.54µs, time_elapsed_processing=21.82ms, time_elapsed_scanning_total=21.57ms, time_elapsed_scanning_until_data=18.30ms] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.040 seconds.

xiedeyantu · 2026-03-29T05:33:31Z

@alamb I have refined the PR description and added a test scenario; could you please take a look and let me know if you find it valuable?

xiedeyantu · 2026-03-31T22:47:07Z

@alamb If you have a moment, could you please take another look? Thank you!

xiedeyantu · 2026-04-01T22:40:56Z

@comphead I'm not sure if you can help me review this PR?

comphead · 2026-04-02T00:17:50Z

I would agree with @alamb to create initial ticket stating the problem. PR description is nice but it is a solution whereas ticket is a problem statement and some people could also participate in problem discussion

xiedeyantu · 2026-04-02T00:55:29Z

I would agree with @alamb to create initial ticket stating the problem. PR description is nice but it is a solution whereas ticket is a problem statement and some people could also participate in problem discussion

@comphead Apologies—I may not have fully understood the process earlier; I thought simply describing the issue within the PR itself (including both the problem and the proposed solution) would suffice. I have now created a new issue #21310 for everyone to discuss. Please take a look and let me know if it looks appropriate.

comphead · 2026-04-02T15:10:34Z

Thanks @xiedeyantu I'll take a look this week, would be super useful for users and also for regression to have internal microbenchmarks, similar to datafusion/core/benches/push_down_filter.rs

xiedeyantu marked this pull request as draft March 20, 2026 10:24

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Mar 20, 2026

xiedeyantu changed the title ~~Add configurable UNION DISTINCT to filter rewrite optimization~~ Add configurable UNION DISTINCT to FILTER rewrite optimization Mar 20, 2026

xiedeyantu force-pushed the union-filter branch from fef8ec7 to 80adf8b Compare March 20, 2026 12:38

github-actions bot added the documentation Improvements or additions to documentation label Mar 20, 2026

Add configurable UNION DISTINCT to FILTER rewrite optimization

3b11b5d

xiedeyantu force-pushed the union-filter branch from 4b113f8 to 3b11b5d Compare March 20, 2026 13:43

xiedeyantu marked this pull request as ready for review March 20, 2026 14:12

xiedeyantu commented Mar 22, 2026

View reviewed changes

xiedeyantu added 2 commits April 3, 2026 23:49

projection should be safe

9ac875f

fix code style

064b8a4

Conversation

xiedeyantu commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Example rewrite

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xiedeyantu commented Mar 21, 2026

Uh oh!

xiedeyantu commented Mar 21, 2026

Uh oh!

alamb commented Mar 22, 2026

Uh oh!

xiedeyantu Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

xiedeyantu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

xiedeyantu commented Mar 29, 2026

Uh oh!

xiedeyantu commented Mar 29, 2026

Uh oh!

xiedeyantu commented Mar 31, 2026

Uh oh!

xiedeyantu commented Apr 1, 2026

Uh oh!

comphead commented Apr 2, 2026

Uh oh!

xiedeyantu commented Apr 2, 2026

Uh oh!

comphead commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiedeyantu commented Mar 20, 2026 •

edited

Loading

xiedeyantu Mar 22, 2026 •

edited

Loading