Skip to content

[CALCITE-7551] Project/Filter/Join transpose and merge rules should not duplicate non-deterministic expressions (e.g. RAND())#4970

Open
darpan-e6 wants to merge 3 commits into
apache:mainfrom
darpan-e6:CALCITE-7551
Open

[CALCITE-7551] Project/Filter/Join transpose and merge rules should not duplicate non-deterministic expressions (e.g. RAND())#4970
darpan-e6 wants to merge 3 commits into
apache:mainfrom
darpan-e6:CALCITE-7551

Conversation

@darpan-e6
Copy link
Copy Markdown
Contributor

@darpan-e6 darpan-e6 commented May 27, 2026

Jira Link

CALCITE-7551

Changes Proposed

Several optimizer and RelBuilder-level transformations inline a lower-projection expression into multiple positions of an upper expression without checking whether the inlined expression is non-deterministic. When the lower expression is non-deterministic (e.g. RAND()), this silently turns one evaluation into many, changing query semantics.

Reproduction

SELECT a, a + 1 AS b FROM (SELECT rand() AS a)

Before the fix, SqlToRelConverter produces:

LogicalProject(A=[RAND()], B=[+(RAND(), 1)])
  LogicalValues(tuples=[[{ 0 }]])

The two RAND() calls are evaluated independently, so B - A is no longer always 1. After the fix the inner projection is preserved and A, B share a single evaluation:

LogicalProject(A=[$0], B=[+($0, 1)])
  LogicalProject(A=[RAND()])
    LogicalValues(tuples=[[{ 0 }]])

Affected code paths and fixes

The guard only kicks in when the transform would actually duplicate a non-deterministic evaluation — not whenever a non-deterministic column merely exists. The precise condition depends on whether the lower project is consumed by the transform or re-emitted above the new node:

Component Mechanism Guard
RelOptUtil.pushPastProjectUnlessBloat pushShuttle(Project) substitutes every RexInputRef with the underlying projection. Only RexOver was guarded; determinism was not. The bottom project is consumed (inlined away). Count RexInputRef uses per bottom slot; return null only if a non-deterministic bottom expression is referenced more than once (a single reference yields a single inlined copy, which is safe). Covers ProjectMergeRule and FilterProjectTransposeRule's helper call, plus the RelBuilder bloat merge that SqlToRelConverter relies on.
FilterProjectTransposeRule Pushes the filter below the project, but re-emits the original project above the new filter. So even one reference splits a single evaluation in two (one inlined into the pushed-down filter, one in the re-emitted project). Skip only when the filter condition references a non-deterministic projected column (RelOptUtil.InputFinder.bits). Filters on deterministic columns are pushed even if the project computes an unrelated RAND().
JoinProjectTransposeRule Uses RexProgramBuilder.mergePrograms + mergedProgram.expandLocalRef, inlining projected expressions into the new join condition while re-emitting the project above. Skip a side only when the join condition references one of that side's non-deterministic projected columns (left at offset 0, right at nLeftFields).
SemiJoinProjectTransposeRule Same mergePrograms + expandLocalRef pattern as JoinProjectTransposeRule. Skip only when the semi-join condition references a non-deterministic projected column on the LHS.

The "consumed vs re-emitted" distinction is why the threshold differs: ProjectMergeRule blocks at > 1 references (the bottom project disappears, so one reference = one surviving copy), while the transpose rules block at >= 1 (the project survives on top, so a single referenced occurrence already yields two evaluations).

The fix uses RexUtil.isDeterministic(...), which correctly distinguishes operators that may return different values across call sites within a single statement (RAND overrides to false) from operators that are non-deterministic only across executions (CURRENT_TIMESTAMP, marked isDynamicFunction = true but still isDeterministic = true). The latter is safe to duplicate and is unaffected by this change.

Tests

Blocked cases (transform must not fire / not duplicate):

  • RelOptRulesTest.testProjectMergeShouldIgnoreNonDeterministic
  • RelOptRulesTest.testFilterProjectTransposeShouldIgnoreNonDeterministic
  • RelOptRulesTest.testJoinProjectTransposeShouldIgnoreNonDeterministic
  • RelOptRulesTest.testSemiJoinProjectTransposeShouldIgnoreNonDeterministic
  • SqlToRelConverterTest.testRandNotDuplicatedInProjectionMerge

Allowed cases (transform still fires when the non-deterministic column is unrelated to the condition):

  • RelOptRulesTest.testFilterProjectTransposeWithUnrelatedNonDeterministic
  • RelOptRulesTest.testJoinProjectTransposeWithUnrelatedNonDeterministic
  • RelOptRulesTest.testSemiJoinProjectTransposeWithUnrelatedNonDeterministic

Confirmed unaffected (verified by additional probing during investigation)

CalcMergeRule (RexProgram preserves CSE via local refs), ProjectReduceExpressionsRule, ProjectValuesMergeRule, ProjectFilterTransposeRule, ProjectJoinTransposeRule, ProjectCorrelateTransposeRule, AggregateProjectMergeRule, FilterSetOpTransposeRule (textual duplication only — each row flows through one branch of a UNION ALL, so per-row evaluation count is unchanged), ProjectToWindowRule.

…ot duplicate non-deterministic expressions (e.g. RAND())
return null;
}
}.visitEach(nodes);
for (int i = 0; i < refs.length; i++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Jira, this is a bit too conservative, since it will not distinguish CURRENT_TIMESTAMP from RAND. But fixing that may be in the scope of a separate PR - we need really two separate notions of nondeterminism.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should complete the fine-grained judgment of the deterministic function first before completing this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, CURRENT_TIMESTAMP is actually different than non-deterministic functions like RAND().

  1. Non-deterministic function: it may return different values for every evaluations.
    1.1 Returns false to isDeterministic().
    1.2 Returns false to isDynamicFunction().

  2. Dynamic Function: It will return same value at every call site within one statement; can differ across executions
    2.1 Returns true to isDeterministic().
    2.2 Returns true to isDynamicFunction().

And this path only blocks when we get isDeterministic() method response as false. Dynamic functions like CURRENT_TIMESTAMP can be duplicated and actually it is safe to duplicate them.

sql(sql).withConformance(SqlConformanceEnum.PRESTO).ok();
}

/** Test case of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is SqlToRel connected to these rules?

Copy link
Copy Markdown
Contributor Author

@darpan-e6 darpan-e6 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because SqlToRelConverter builds its RelNode tree through RelBuilder, and RelBuilder eagerly merges adjacent projects at construction time using the same helper that the planner rules use.

Concretely, RelBuilder.project_ method uses the method RelOptUtil.pushPastProjectUnlessBloat(nodeList, project, config.bloat()) which we are fixing as part of this PR, so I thought of putting a test here as well.

// non-deterministic evaluation (e.g. RAND()) into two: one consumed by
// the new filter condition, and the original still produced by the
// project above. Refuse to transpose in that case.
if (!project.getProjects().stream().allMatch(RexUtil::isDeterministic)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks transpose (because RAND() exists in project)? eg:

SELECT * FROM (
    SELECT RAND() as r, col1 as b FROM emp
  ) 
  WHERE col1 > 0

darpan-e6 added 2 commits May 28, 2026 10:13
…ferences a non-deterministic projected column
…ondition references a non-deterministic projected column
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants