Skip to content

[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889

Closed
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56869
Closed

[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56869

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Gate CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) } in four TreeNode rule-driven transform methods behind rule.isDefinedAt(...), so the ThreadLocal wrap only runs when the rule actually fires:

  • TreeNode.transformDownWithPruning
  • TreeNode.transformUpWithPruning
  • TreeNode.transformUpWithBeforeAndAfterRuleOnChildren
  • TreeNode.multiTransformDownWithPruning (also drops a side-effecting default closure that became unnecessary)

Why are the changes needed?

CurrentOrigin.withOrigin is only observable when the rule constructs new nodes — they pick up CurrentOrigin.get in their override val origin field. On nodes the rule doesn't match, the wrap is pure overhead: two ThreadLocal writes plus a try/finally per node visit.

JFR profiling (60s sample, 1.1M iterations of transformDown over a 1024-leaf balanced Add tree with a non-matching rule) shows:

  • 66% of CPU samples in ThreadLocalMap.set (line 486)
  • 13% in ThreadLocalMap.getEntryAfterMiss
  • 9% more in ThreadLocalMap.set (line 493)

Total: ~88% of transform CPU spent inside CurrentOrigin.withOrigin for nodes the rule never matched.

Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz, baseline = upstream/master):

case baseline optimized speedup
transformDown deep chain(5000) no-op 12 ms 6 ms 2.0x
transformDown deep chain(5000) rewrite leaf 20109 ms 15850 ms 1.27x
transformDown wide(100) no-op 5 ms 3 ms 1.7x
transformDown balanced(1024) no-op 7 ms 2 ms 3.5x
transformDown balanced(4096) no-op 34 ms 15 ms 2.3x
transformUp deep chain(1000) no-op 3 ms 1 ms 3.0x
transformUp deep chain(5000) no-op 15 ms 6 ms 2.5x
transformUp balanced(1024) no-op 8 ms 3 ms 2.7x
transformUp balanced(4096) no-op 39 ms 25 ms 1.6x

Rewrite-heavy cases are unchanged because withOrigin still runs when the rule fires. Real Spark workloads (analyzer / optimizer batches running many rules across many nodes, where each rule matches a small subset) are dominated by the no-match case, so these savings should compound.

Does this PR introduce any user-facing change?

No. The change is internal to TreeNode rule machinery and preserves all observable semantics:

  • CurrentOrigin is still set before any rule body that constructs new nodes runs.
  • markRuleAsIneffective / isRuleIneffective bookkeeping unchanged.
  • copyTagsFrom ordering on rule-replacement unchanged.
  • fastEquals short-circuit unchanged.
  • Result identity (this eq result on a no-op transform) preserved.

How was this patch tested?

  • build/sbt 'catalyst/testOnly *TreeNodeSuite' — 36 / 36 pass.
  • build/sbt 'catalyst/test' — 9272 / 9272 pass across 352 suites (~7 min).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.7)

### What changes were proposed in this pull request?

Gate `CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) }` in four
`TreeNode` rule-driven transform methods behind `rule.isDefinedAt(...)`, so
the ThreadLocal wrap only runs when the rule actually fires:

- `TreeNode.transformDownWithPruning`
- `TreeNode.transformUpWithPruning`
- `TreeNode.transformUpWithBeforeAndAfterRuleOnChildren`
- `TreeNode.multiTransformDownWithPruning` (also drops a side-effecting
  default closure that became unnecessary)

### Why are the changes needed?

`CurrentOrigin.withOrigin` is only observable when the rule constructs new
nodes — they pick up `CurrentOrigin.get` in their `override val origin`
field. On nodes the rule doesn't match, the wrap is pure overhead: two
`ThreadLocal` writes plus a `try`/`finally` per node visit.

JFR profiling (60s sample, 1.1M iterations of `transformDown` over a
1024-leaf balanced `Add` tree with a non-matching rule) shows:

- 66% of CPU samples in `ThreadLocalMap.set` (line 486)
- 13% in `ThreadLocalMap.getEntryAfterMiss`
-  9% more in `ThreadLocalMap.set` (line 493)

Total: ~88% of transform CPU spent inside `CurrentOrigin.withOrigin` for
nodes the rule never matched.

Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz,
baseline = `upstream/master`):

| case                                       | baseline | optimized | speedup |
|--------------------------------------------|---------:|----------:|--------:|
| transformDown deep chain(5000) no-op       |    12 ms |      6 ms |    2.0x |
| transformDown deep chain(5000) rewrite leaf|  20109 ms|  15850 ms |   1.27x |
| transformDown wide(100) no-op              |     5 ms |      3 ms |    1.7x |
| transformDown balanced(1024) no-op         |     7 ms |      2 ms |    3.5x |
| transformDown balanced(4096) no-op         |    34 ms |     15 ms |    2.3x |
| transformUp deep chain(1000) no-op         |     3 ms |      1 ms |    3.0x |
| transformUp deep chain(5000) no-op         |    15 ms |      6 ms |    2.5x |
| transformUp balanced(1024) no-op           |     8 ms |      3 ms |    2.7x |
| transformUp balanced(4096) no-op           |    39 ms |     25 ms |    1.6x |

Rewrite-heavy cases are unchanged because `withOrigin` still runs when the
rule fires. Real Spark workloads (analyzer/optimizer batches running many
rules across many nodes, each rule matching a small subset) are dominated
by the no-match case, so the savings compound.

### Does this PR introduce _any_ user-facing change?

No. The change is internal to `TreeNode` rule machinery and preserves all
observable semantics:

- `CurrentOrigin` is still set before any rule body that constructs new
  nodes runs.
- `markRuleAsIneffective` / `isRuleIneffective` bookkeeping unchanged.
- `copyTagsFrom` ordering on rule-replacement unchanged.
- `fastEquals` short-circuit unchanged.
- Result identity (`this eq result` on a no-op transform) preserved.

### How was this patch tested?

- `build/sbt 'catalyst/testOnly *TreeNodeSuite'` — 36/36 pass.
- `build/sbt 'catalyst/test'` — 9272/9272 pass across 352 suites (~7 min).

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.7)

Co-authored-by: Isaac
@gengliangwang gengliangwang requested a review from cloud-fan May 14, 2026 21:53
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense and looks like a nice improvement, but @gengliangwang can you please check the 2 failures and fix lint errors.

@gengliangwang
Copy link
Copy Markdown
Member Author

Closing this PR — found a regression on postgreSQL/window_part2.sql that revealed the optimization is fundamentally unsafe. Posting the analysis here for the next attempt.

The regression

Query #28 expects CAST_INVALID_INPUT with queryContext at the window spec (startIndex: 83, stopIndex: 163). With this PR's transformUpWithPruning change, the queryContext widens to the whole query (startIndex: 1, stopIndex: 163). Same exception, wrong source span.

Root cause

The optimization replaces:

val afterRule = CurrentOrigin.withOrigin(origin) {
  rule.applyOrElse(this, identity[BaseType])
}

with:

val afterRule = if (rule.isDefinedAt(this)) {
  CurrentOrigin.withOrigin(origin) { rule.apply(this) }
} else { this }

The intent: skip withOrigin's ThreadLocal set/restore when the rule doesn't match. The flaw: isDefinedAt is called outside withOrigin.

For AnsiCombinedTypeCoercionRule, the rule is Function.unlift { e => folds-through-sub-rules }. Scala's Unlifted.isDefinedAt(a) = f(a).isDefined — it runs the underlying function, which constructs new nodes (Cast(...), s.copy(...)). Each new TreeNode captures CurrentOrigin.get at construction.

WindowFrameTypeCoercion constructs Cast(Literal('NaN'), IntegerType) during this probe. With my change the Cast is constructed outside withOrigin, so its origin field captures the outer Project's origin (the whole query) instead of the WindowSpec's origin.

Spark eagerly evaluates window frame bound expressions during analysis (they must be foldable). Cast.eval on 'NaN' → INT throws SparkNumberFormatException with context = cast.origin.context = (1, 163) — the wrong span. The exception escapes isDefinedAt carrying the wrong context. The user sees an error pointing at the whole query.

Trace evidence

[TRACE]   this.origin=(82,162)  CurrentOrigin=(0,162)
[THREW from isDefinedAt] SparkNumberFormatException: CAST_INVALID_INPUT
                         The value 'NaN' of the type "STRING" cannot be cast to "INT"

Why catalyst tests didn't catch this

The exception itself is thrown either way — only the queryContext attached to it differs. Catalyst's 9272 tests check behavior; SQLQueryTestSuite's golden files check exact error-message positions.

Why this optimization is fundamentally unsafe

Wherever a PartialFunction.isDefinedAt has side effects that construct TreeNodes — and any unlift'd PF guarantees this — the optimization captures the wrong CurrentOrigin for nodes constructed during the probe. Same theoretical risk applies to transformDownWithPruning (no failing test surfaced, but the structure is identical).

The fix patterns I considered all either:

  • Wrap isDefinedAt in withOrigin too — defeats the optimization (equivalent to original).
  • Use a sentinel default with applyOrElse inside withOrigin — same.
  • Require PFs to opt in via a marker trait promising pure isDefinedAt — invasive, doesn't fix CombinedTypeCoercionRule.

None gains the perf and stays correct.

What's worth keeping from the investigation

JFR profile on a representative transformDown (1024-leaf balanced Add tree, non-matching rule, 60s sample): ~88% of CPU in ThreadLocal$ThreadLocalMap.set / getEntryAfterMiss inside CurrentOrigin.withOrigin. There's real money on the floor. Capturing it likely requires one of:

  1. A different CurrentOrigin representation that avoids per-node ThreadLocal cost.
  2. Moving CurrentOrigin propagation off the per-node hot path (set once per rule batch).
  3. Refactoring CombinedTypeCoercionRule so its isDefinedAt is pure — making the isDefinedAt-then-apply pattern safe.

JIRA SPARK-56869 will stay open for the next attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants