HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST by thomasrebele · Pull Request #6293 · apache/hive

thomasrebele · 2026-02-04T00:01:44Z

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

IS NOT NULL(CAST($1):BIGINT)
=(CAST($1):DOUBLE, 1)
IN(CAST($1):TINYINT, 10, 20, 30)

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

zabetak · 2026-02-12T13:25:50Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

thomasrebele

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

thomasrebele · 2026-02-17T11:23:24Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {


We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

…cates with a CAST

thomasrebele · 2026-02-20T09:50:18Z

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

The removeCastIfPossible was doing three things: 1) Checking if a cast can be removed based using column stats 2) Removing the cast if possible 3) Adjusting the boundaries in case of DECIMAL casts After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.

No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.

zabetak · 2026-02-23T08:06:44Z

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

zabetak · 2026-02-23T12:04:17Z

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
+    checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));


Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

...c/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

sonarqubecloud · 2026-02-25T15:19:11Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zabetak

Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.

zabetak · 2026-02-26T08:48:44Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+        typeBoundaries =
+            getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType());
+        rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries);


Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?

zabetak · 2026-02-26T09:11:45Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java

+      final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();
+      boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());


Suggested change

final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();

boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());

boolean inverseBool = RexLiteral.booleanValue(operands.getFirst());

Minor suggestion. If we need to commit something more on the PR we can fix this as well otherwise not worth addressing on its own.

asf-ci-hive added tests pending tests passed and removed tests pending labels Feb 4, 2026

thomasrebele commented Feb 4, 2026

View reviewed changes

thomasrebele marked this pull request as ready for review February 4, 2026 08:53

zabetak reviewed Feb 12, 2026

View reviewed changes

thomasrebele commented Feb 17, 2026

View reviewed changes

HIVE-29424: CBO plans should use histogram statistics for range predi…

1e9fd2b

…cates with a CAST

thomasrebele force-pushed the tr/HIVE-29424 branch from f80c231 to 1e9fd2b Compare February 19, 2026 16:39

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Feb 19, 2026

zabetak added 3 commits February 20, 2026 15:08

Generalize StatsUtils#isWithin and use in FilterSelectivityEstimator

638376e

Replace FloatInterval with Guava's Range API

ef8dc6c

No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.

zabetak reviewed Feb 23, 2026

View reviewed changes

thomasrebele added 6 commits February 24, 2026 16:51

Avoid a MutableObject<float[]>

b93db3f

Fix tests

ccce3ba

Avoid mutating the arguments

c4b2e5a

Implement review comments

b311f22

Comments

1c6cccf

Compare boundaries directly

fbd7116

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Feb 25, 2026

Fix test and Sonar Qube warnings

fc88104

asf-ci-hive added tests pending and removed tests unstable labels Feb 25, 2026

asf-ci-hive added tests passed and removed tests pending labels Feb 25, 2026

zabetak reviewed Feb 26, 2026

View reviewed changes

		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
		checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));

		final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();
		boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());

	final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();
	boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());
	boolean inverseBool = RexLiteral.booleanValue(operands.getFirst());

Conversation

thomasrebele commented Feb 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele commented Feb 20, 2026

Uh oh!

zabetak commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 25, 2026

Quality Gate passed

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants