Add configurable UNION DISTINCT to FILTER rewrite optimization#21075
Add configurable UNION DISTINCT to FILTER rewrite optimization#21075xiedeyantu wants to merge 3 commits intoapache:mainfrom
Conversation
fef8ec7 to
80adf8b
Compare
4b113f8 to
3b11b5d
Compare
|
Hi @alamb , here's another PR related to plan optimization. Do we need it? I'd also like to know what aspects of optimization we accept. |
|
Hi @Dandandan , I noticed that you’re very knowledgeable about SQL optimization. It would be great if you could help me review this! |
|
@xiedeyantu -- can you please file a ticket explaining what usecase you are targeting with this optimization? Your explanation in
Mostly focuses on "what" is changed, not the "why" In terms of usecase I think what is important:
Then for this optimization it would be great to have some benchmark numbers showing that the query of interest iindeed gets faster with this optimization compared than without it |
|
|
||
| query TT | ||
| EXPLAIN SELECT id, name FROM t1 WHERE id = 1 UNION SELECT id, name FROM t1 WHERE id = 2 | ||
| ---- |
There was a problem hiding this comment.
@alamb Here is an example: There is a rule for eliminating UNION under specific conditions. It applies when the branches of the UNION come from the same table and only differ in their WHERE conditions. This rule allows us to avoid an extra table scan — we only need to perform a single combined conditional filter.
There was a problem hiding this comment.
I haven't looked at the proposed implementation but the rewrite can surely help in case of repeated costly union branches (especially when coming from possibly complex data sources powered via TableProvider).
It seems also particularly relevant until #8777 gets addressed (broader scope, CTE materialization), as currently there is no other way to mutualize repeated reads AFAIK.
There was a problem hiding this comment.
Thank you @asolimando , for connecting me to such a valuable discussion. I think the idea in #8777 is excellent, but it seems we might need to perform a union operation once more. It looks like a final conclusion hasn't been reached yet? However, this approach can support UNION ALL. It seems the two might complement each other?
|
I tested the execution times, and in reality, the difference between the two was not significant. My understanding is that the multiple branches within a The test SQL script is as follows: |
|
@alamb I have refined the PR description and added a test scenario; could you please take a look and let me know if you find it valuable? |
|
@alamb If you have a moment, could you please take another look? Thank you! |
|
@comphead I'm not sure if you can help me review this PR? |
|
I would agree with @alamb to create initial ticket stating the problem. PR description is nice but it is a solution whereas ticket is a problem statement and some people could also participate in problem discussion |
@comphead Apologies—I may not have fully understood the process earlier; I thought simply describing the issue within the PR itself (including both the problem and the proposed solution) would suffice. I have now created a new issue #21310 for everyone to discuss. Please take a look and let me know if it looks appropriate. |
|
Thanks @xiedeyantu I'll take a look this week, would be super useful for users and also for regression to have internal microbenchmarks, similar to |
Which issue does this PR close?
Rationale for this change
This PR adds a configurable optimizer rewrite for
UNION DISTINCTqueries. The goal is to allow the optimizer to collapse eligible union branches into a single filtered scan when the branches read from the same source and differ only by filter predicates.This optimization can reduce duplicated work and avoid scanning the same input multiple times. Keeping it behind a configuration flag makes the behavior explicit and safe to enable only when desired.
What changes are included in this PR?
datafusion.optimizer.enable_unions_to_filter, which is disabled by default.UnionsToFilteroptimization rule in the logical optimizer pipeline.datafusion/sqllogictest/test_files/union.sltto cover both the disabled and enabled cases.UNION DISTINCTqueries.Example rewrite
When the rule is enabled, a query such as:
SQL
may be rewritten into an equivalent plan that scans
t1once and applies a combined filter such as:SQL
This keeps the results unchanged while avoiding repeated reads from the same source.
Are these changes tested?
Yes.
The new behavior is covered by sqllogictest cases that validate both plan variants:
UNION DISTINCTexecution path when the option is disabledAre there any user-facing changes?
Yes.
A new configuration option is introduced:
datafusion.optimizer.enable_unions_to_filterWhen enabled, some
UNION DISTINCTqueries may be optimized into a different plan shape. Query results remain the same, but the execution plan may change.