[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match by gengliangwang · Pull Request #55889 · apache/spark

gengliangwang · 2026-05-14T21:51:25Z

What changes were proposed in this pull request?

Gate CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) } in four TreeNode rule-driven transform methods behind rule.isDefinedAt(...), so the ThreadLocal wrap only runs when the rule actually fires:

TreeNode.transformDownWithPruning
TreeNode.transformUpWithPruning
TreeNode.transformUpWithBeforeAndAfterRuleOnChildren
TreeNode.multiTransformDownWithPruning (also drops a side-effecting default closure that became unnecessary)

Why are the changes needed?

CurrentOrigin.withOrigin is only observable when the rule constructs new nodes — they pick up CurrentOrigin.get in their override val origin field. On nodes the rule doesn't match, the wrap is pure overhead: two ThreadLocal writes plus a try/finally per node visit.

JFR profiling (60s sample, 1.1M iterations of transformDown over a 1024-leaf balanced Add tree with a non-matching rule) shows:

66% of CPU samples in ThreadLocalMap.set (line 486)
13% in ThreadLocalMap.getEntryAfterMiss
9% more in ThreadLocalMap.set (line 493)

Total: ~88% of transform CPU spent inside CurrentOrigin.withOrigin for nodes the rule never matched.

Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz, baseline = upstream/master):

case	baseline	optimized	speedup
`transformDown` deep chain(5000) no-op	12 ms	6 ms	2.0x
`transformDown` deep chain(5000) rewrite leaf	20109 ms	15850 ms	1.27x
`transformDown` wide(100) no-op	5 ms	3 ms	1.7x
`transformDown` balanced(1024) no-op	7 ms	2 ms	3.5x
`transformDown` balanced(4096) no-op	34 ms	15 ms	2.3x
`transformUp` deep chain(1000) no-op	3 ms	1 ms	3.0x
`transformUp` deep chain(5000) no-op	15 ms	6 ms	2.5x
`transformUp` balanced(1024) no-op	8 ms	3 ms	2.7x
`transformUp` balanced(4096) no-op	39 ms	25 ms	1.6x

Rewrite-heavy cases are unchanged because withOrigin still runs when the rule fires. Real Spark workloads (analyzer / optimizer batches running many rules across many nodes, where each rule matches a small subset) are dominated by the no-match case, so these savings should compound.

Does this PR introduce any user-facing change?

No. The change is internal to TreeNode rule machinery and preserves all observable semantics:

CurrentOrigin is still set before any rule body that constructs new nodes runs.
markRuleAsIneffective / isRuleIneffective bookkeeping unchanged.
copyTagsFrom ordering on rule-replacement unchanged.
fastEquals short-circuit unchanged.
Result identity (this eq result on a no-op transform) preserved.

How was this patch tested?

build/sbt 'catalyst/testOnly *TreeNodeSuite' — 36 / 36 pass.
build/sbt 'catalyst/test' — 9272 / 9272 pass across 352 suites (~7 min).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.7)

### What changes were proposed in this pull request? Gate `CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) }` in four `TreeNode` rule-driven transform methods behind `rule.isDefinedAt(...)`, so the ThreadLocal wrap only runs when the rule actually fires: - `TreeNode.transformDownWithPruning` - `TreeNode.transformUpWithPruning` - `TreeNode.transformUpWithBeforeAndAfterRuleOnChildren` - `TreeNode.multiTransformDownWithPruning` (also drops a side-effecting default closure that became unnecessary) ### Why are the changes needed? `CurrentOrigin.withOrigin` is only observable when the rule constructs new nodes — they pick up `CurrentOrigin.get` in their `override val origin` field. On nodes the rule doesn't match, the wrap is pure overhead: two `ThreadLocal` writes plus a `try`/`finally` per node visit. JFR profiling (60s sample, 1.1M iterations of `transformDown` over a 1024-leaf balanced `Add` tree with a non-matching rule) shows: - 66% of CPU samples in `ThreadLocalMap.set` (line 486) - 13% in `ThreadLocalMap.getEntryAfterMiss` - 9% more in `ThreadLocalMap.set` (line 493) Total: ~88% of transform CPU spent inside `CurrentOrigin.withOrigin` for nodes the rule never matched. Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz, baseline = `upstream/master`): | case | baseline | optimized | speedup | |--------------------------------------------|---------:|----------:|--------:| | transformDown deep chain(5000) no-op | 12 ms | 6 ms | 2.0x | | transformDown deep chain(5000) rewrite leaf| 20109 ms| 15850 ms | 1.27x | | transformDown wide(100) no-op | 5 ms | 3 ms | 1.7x | | transformDown balanced(1024) no-op | 7 ms | 2 ms | 3.5x | | transformDown balanced(4096) no-op | 34 ms | 15 ms | 2.3x | | transformUp deep chain(1000) no-op | 3 ms | 1 ms | 3.0x | | transformUp deep chain(5000) no-op | 15 ms | 6 ms | 2.5x | | transformUp balanced(1024) no-op | 8 ms | 3 ms | 2.7x | | transformUp balanced(4096) no-op | 39 ms | 25 ms | 1.6x | Rewrite-heavy cases are unchanged because `withOrigin` still runs when the rule fires. Real Spark workloads (analyzer/optimizer batches running many rules across many nodes, each rule matching a small subset) are dominated by the no-match case, so the savings compound. ### Does this PR introduce _any_ user-facing change? No. The change is internal to `TreeNode` rule machinery and preserves all observable semantics: - `CurrentOrigin` is still set before any rule body that constructs new nodes runs. - `markRuleAsIneffective` / `isRuleIneffective` bookkeeping unchanged. - `copyTagsFrom` ordering on rule-replacement unchanged. - `fastEquals` short-circuit unchanged. - Result identity (`this eq result` on a no-op transform) preserved. ### How was this patch tested? - `build/sbt 'catalyst/testOnly *TreeNodeSuite'` — 36/36 pass. - `build/sbt 'catalyst/test'` — 9272/9272 pass across 352 suites (~7 min). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic Claude Opus 4.7) Co-authored-by: Isaac

peter-toth

Makes sense and looks like a nice improvement, but @gengliangwang can you please check the 2 failures and fix lint errors.

gengliangwang · 2026-05-15T16:35:09Z

Closing this PR — found a regression on postgreSQL/window_part2.sql that revealed the optimization is fundamentally unsafe. Posting the analysis here for the next attempt.

The regression

Query #28 expects CAST_INVALID_INPUT with queryContext at the window spec (startIndex: 83, stopIndex: 163). With this PR's transformUpWithPruning change, the queryContext widens to the whole query (startIndex: 1, stopIndex: 163). Same exception, wrong source span.

Root cause

The optimization replaces:

val afterRule = CurrentOrigin.withOrigin(origin) {
  rule.applyOrElse(this, identity[BaseType])
}

with:

val afterRule = if (rule.isDefinedAt(this)) {
  CurrentOrigin.withOrigin(origin) { rule.apply(this) }
} else { this }

The intent: skip withOrigin's ThreadLocal set/restore when the rule doesn't match. The flaw: isDefinedAt is called outside withOrigin.

For AnsiCombinedTypeCoercionRule, the rule is Function.unlift { e => folds-through-sub-rules }. Scala's Unlifted.isDefinedAt(a) = f(a).isDefined — it runs the underlying function, which constructs new nodes (Cast(...), s.copy(...)). Each new TreeNode captures CurrentOrigin.get at construction.

WindowFrameTypeCoercion constructs Cast(Literal('NaN'), IntegerType) during this probe. With my change the Cast is constructed outside withOrigin, so its origin field captures the outer Project's origin (the whole query) instead of the WindowSpec's origin.

Spark eagerly evaluates window frame bound expressions during analysis (they must be foldable). Cast.eval on 'NaN' → INT throws SparkNumberFormatException with context = cast.origin.context = (1, 163) — the wrong span. The exception escapes isDefinedAt carrying the wrong context. The user sees an error pointing at the whole query.

Trace evidence

[TRACE]   this.origin=(82,162)  CurrentOrigin=(0,162)
[THREW from isDefinedAt] SparkNumberFormatException: CAST_INVALID_INPUT
                         The value 'NaN' of the type "STRING" cannot be cast to "INT"

Why catalyst tests didn't catch this

The exception itself is thrown either way — only the queryContext attached to it differs. Catalyst's 9272 tests check behavior; SQLQueryTestSuite's golden files check exact error-message positions.

Why this optimization is fundamentally unsafe

Wherever a PartialFunction.isDefinedAt has side effects that construct TreeNodes — and any unlift'd PF guarantees this — the optimization captures the wrong CurrentOrigin for nodes constructed during the probe. Same theoretical risk applies to transformDownWithPruning (no failing test surfaced, but the structure is identical).

The fix patterns I considered all either:

Wrap isDefinedAt in withOrigin too — defeats the optimization (equivalent to original).
Use a sentinel default with applyOrElse inside withOrigin — same.
Require PFs to opt in via a marker trait promising pure isDefinedAt — invasive, doesn't fix CombinedTypeCoercionRule.

None gains the perf and stays correct.

What's worth keeping from the investigation

JFR profile on a representative transformDown (1024-leaf balanced Add tree, non-matching rule, 60s sample): ~88% of CPU in ThreadLocal$ThreadLocalMap.set / getEntryAfterMiss inside CurrentOrigin.withOrigin. There's real money on the floor. Capturing it likely requires one of:

A different CurrentOrigin representation that avoids per-node ThreadLocal cost.
Moving CurrentOrigin propagation off the per-node hot path (set once per rule batch).
Refactoring CombinedTypeCoercionRule so its isDefinedAt is pure — making the isDefinedAt-then-apply pattern safe.

JIRA SPARK-56869 will stay open for the next attempt.

gengliangwang requested a review from cloud-fan May 14, 2026 21:53

peter-toth requested changes May 15, 2026

View reviewed changes

gengliangwang closed this May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889

[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56869

gengliangwang commented May 14, 2026

Uh oh!

peter-toth left a comment

Uh oh!

gengliangwang commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gengliangwang commented May 14, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented May 15, 2026

The regression

Root cause

Trace evidence

Why catalyst tests didn't catch this

Why this optimization is fundamentally unsafe

What's worth keeping from the investigation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants