HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293thomasrebele wants to merge 11 commits intoapache:masterfrom
Conversation
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
zabetak
left a comment
There was a problem hiding this comment.
Thanks for the PR @thomasrebele , the proposal is very promising.
One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:
IS NOT NULL(CAST($1):BIGINT)=(CAST($1):DOUBLE, 1)IN(CAST($1):TINYINT, 10, 20, 30)
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?
In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
There was a problem hiding this comment.
Ok to use the switch then but let's base it on SqlTypeName.
If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.
There was a problem hiding this comment.
I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.
I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.
About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
|
||
| double min; | ||
| double max; | ||
| switch (type.toLowerCase()) { |
There was a problem hiding this comment.
We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.
…cates with a CAST
f80c231 to
1e9fd2b
Compare
|
The CI fails because of |
The removeCastIfPossible was doing three things: 1) Checking if a cast can be removed based using column stats 2) Removing the cast if possible 3) Adjusting the boundaries in case of DECIMAL casts After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.
No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.
|
Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch. However, after replacing the |
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Show resolved
Hide resolved
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE)); | ||
| checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP)); |
There was a problem hiding this comment.
Why checkTimeFieldOnIntraDayTimestamps are not relevant here?
There was a problem hiding this comment.
Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Show resolved
Hide resolved
...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
Outdated
Show resolved
Hide resolved
|
zabetak
left a comment
There was a problem hiding this comment.
Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.
| typeBoundaries = | ||
| getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType()); | ||
| rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries); |
There was a problem hiding this comment.
Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?
| final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue(); | ||
| boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString()); |
There was a problem hiding this comment.
| final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue(); | |
| boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString()); | |
| boolean inverseBool = RexLiteral.booleanValue(operands.getFirst()); |
There was a problem hiding this comment.
Minor suggestion. If we need to commit something more on the PR we can fix this as well otherwise not worth addressing on its own.



See HIVE-29424.
What changes were proposed in this pull request?
This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.
Why are the changes needed?
This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests were added.