Skip to content

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293

Open
thomasrebele wants to merge 11 commits intoapache:masterfrom
thomasrebele:tr/HIVE-29424
Open

HIVE-29424: CBO plans should use histogram statistics for range predicates with a CAST#6293
thomasrebele wants to merge 11 commits intoapache:masterfrom
thomasrebele:tr/HIVE-29424

Conversation

@thomasrebele
Copy link
Contributor

See HIVE-29424.

What changes were proposed in this pull request?

This PR adapts FilterSelectivityEstimator so that histogram statistics are used for range predicates with a cast.
I added many test cases to some cover corner cases. To get the ground truth, I executed queries with the predicates, see the resulting q.out file.

Why are the changes needed?

This PR allows the CBO planner to use histogram statistics for range predicates that contain a CAST around the input column.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

@thomasrebele thomasrebele marked this pull request as ready for review February 4, 2026 08:53
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thomasrebele , the proposal is very promising.

One general question that came to mind while I was reviewing the PR is if the CAST removal is relevant only for range predicates and histograms or if it can have a positive impact on other expressions. For example, is there any benefit in attempting to remove a CAST from the following expressions:

  • IS NOT NULL(CAST($1):BIGINT)
  • =(CAST($1):DOUBLE, 1)
  • IN(CAST($1):TINYINT, 10, 20, 30)


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is mostly using Calcite APIs so since we have the SqlTypeName readily available wouldn't be better to use that instead?

In addition there is org.apache.calcite.sql.type.SqlTypeName#getLimit which might be relevant and could potentially replace this switch statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to use the switch then but let's base it on SqlTypeName.

If it makes sense to handle FLOAT/DOUBLE in SqlTypeName#getLimit then it would be a good idea to log a CALCITE JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the switch and verified that the result of the getLimit call results in the same min/max values.

I don't know whether there's a limit for FLOAT/DOUBLE, so I've created CALCITE-7419 for the discussion.

Copy link
Contributor Author

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review, @zabetak! Removing the cast from other expressions might be beneficial for the selectivity estimation. I would consider these improvements as out-of-scope for this PR, though.

About the first example, IS NOT NULL(CAST($1):BIGINT), CALCITE-5769 improved RexSimplify to remove the cast from the expression. I assume that the filters that arrive at FilterSelectivityEstimator should remove the cast, if it is superfluous. Otherwise, it could converted to a range predicate for the purpose of selectivity estimation. I would leave this idea for other tickets.


double min;
double max;
switch (type.toLowerCase()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SqlTypeName#getLimit for the integer types. The method throws an exception for FLOAT/DOUBLE, so we would still need the switch statement.

@thomasrebele
Copy link
Contributor Author

The CI fails because of No thread with name metastore_task_thread_test_impl_3 found. in TestMetastoreLeaseLeader.testHouseKeepingThreads. I do not think that the failure is related to this PR.

The removeCastIfPossible was doing three things:
1) Checking if a cast can be removed based using column stats
2) Removing the cast if possible
3) Adjusting the boundaries in case of DECIMAL casts

After the refactoring the three actions are decoupled and each is performed individually. This leads to smaller and more self-contained methods that are easier to follow.
No need to invent new APIs when equivalent exists and used in other places in Hive/Calcite.
@zabetak
Copy link
Member

zabetak commented Feb 23, 2026

Hey @thomasrebele , I was going over the PR and did some refactoring to help me understand better some parts of the code and hopefully and improve a bit readability. My refactoring work can be found in the https://github.com/zabetak/hive/tree/HIVE-29424-r1 branch.

However, after replacing the FloatInterval with Guava's Range API in commit ef8dc6c some tests in TestFilterSelectivityEstimator started failing cause it appears that some ranges are invalid. Specifically, the adjustTypeBoundaries creates a strange/invalid range (i.e., (100.049995..99.94999]) when rangeBoundaries is (100.0..Infinity] and type is DECIMAL(3, 1); it is strange to have a range/interval with a lower bound (100.049995) greater than the upper bound (99.94999) so wanted to check with you if that behavior is expected/intentional.

Comment on lines 730 to 731
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.DATE));
checkTimeFieldOnMidnightTimestamps(cast("f_timestamp", SqlTypeName.TIMESTAMP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checkTimeFieldOnIntraDayTimestamps are not relevant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, they should be here, but not on testComputeRangePredicateSelectivityDateWithCast. Fixed.

@sonarqubecloud
Copy link

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like the new changes. I just have a question about a potentially missing check for DECIMAL types and we are good to go.

Comment on lines +453 to +455
typeBoundaries =
getRangeOfDecimalType(expr.getType(), rangeBoundaries.lowerBoundType(), rangeBoundaries.upperBoundType());
rangeBoundaries = adjustRangeToDecimalType(rangeBoundaries, expr.getType(), typeBoundaries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing a conditional here so that it runs only for DECIMAL type? Do we have adequate test coverage?

Comment on lines +441 to +442
final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();
boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final Object inverseBoolValueObject = ((RexLiteral) operands.getFirst()).getValue();
boolean inverseBool = Boolean.parseBoolean(inverseBoolValueObject.toString());
boolean inverseBool = RexLiteral.booleanValue(operands.getFirst());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion. If we need to commit something more on the PR we can fix this as well otherwise not worth addressing on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants