[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889
[SPARK-56869][SQL] Speed up TreeNode transforms when rule doesn't match#55889gengliangwang wants to merge 1 commit into
Conversation
### What changes were proposed in this pull request?
Gate `CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) }` in four
`TreeNode` rule-driven transform methods behind `rule.isDefinedAt(...)`, so
the ThreadLocal wrap only runs when the rule actually fires:
- `TreeNode.transformDownWithPruning`
- `TreeNode.transformUpWithPruning`
- `TreeNode.transformUpWithBeforeAndAfterRuleOnChildren`
- `TreeNode.multiTransformDownWithPruning` (also drops a side-effecting
default closure that became unnecessary)
### Why are the changes needed?
`CurrentOrigin.withOrigin` is only observable when the rule constructs new
nodes — they pick up `CurrentOrigin.get` in their `override val origin`
field. On nodes the rule doesn't match, the wrap is pure overhead: two
`ThreadLocal` writes plus a `try`/`finally` per node visit.
JFR profiling (60s sample, 1.1M iterations of `transformDown` over a
1024-leaf balanced `Add` tree with a non-matching rule) shows:
- 66% of CPU samples in `ThreadLocalMap.set` (line 486)
- 13% in `ThreadLocalMap.getEntryAfterMiss`
- 9% more in `ThreadLocalMap.set` (line 493)
Total: ~88% of transform CPU spent inside `CurrentOrigin.withOrigin` for
nodes the rule never matched.
Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz,
baseline = `upstream/master`):
| case | baseline | optimized | speedup |
|--------------------------------------------|---------:|----------:|--------:|
| transformDown deep chain(5000) no-op | 12 ms | 6 ms | 2.0x |
| transformDown deep chain(5000) rewrite leaf| 20109 ms| 15850 ms | 1.27x |
| transformDown wide(100) no-op | 5 ms | 3 ms | 1.7x |
| transformDown balanced(1024) no-op | 7 ms | 2 ms | 3.5x |
| transformDown balanced(4096) no-op | 34 ms | 15 ms | 2.3x |
| transformUp deep chain(1000) no-op | 3 ms | 1 ms | 3.0x |
| transformUp deep chain(5000) no-op | 15 ms | 6 ms | 2.5x |
| transformUp balanced(1024) no-op | 8 ms | 3 ms | 2.7x |
| transformUp balanced(4096) no-op | 39 ms | 25 ms | 1.6x |
Rewrite-heavy cases are unchanged because `withOrigin` still runs when the
rule fires. Real Spark workloads (analyzer/optimizer batches running many
rules across many nodes, each rule matching a small subset) are dominated
by the no-match case, so the savings compound.
### Does this PR introduce _any_ user-facing change?
No. The change is internal to `TreeNode` rule machinery and preserves all
observable semantics:
- `CurrentOrigin` is still set before any rule body that constructs new
nodes runs.
- `markRuleAsIneffective` / `isRuleIneffective` bookkeeping unchanged.
- `copyTagsFrom` ordering on rule-replacement unchanged.
- `fastEquals` short-circuit unchanged.
- Result identity (`this eq result` on a no-op transform) preserved.
### How was this patch tested?
- `build/sbt 'catalyst/testOnly *TreeNodeSuite'` — 36/36 pass.
- `build/sbt 'catalyst/test'` — 9272/9272 pass across 352 suites (~7 min).
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus 4.7)
Co-authored-by: Isaac
peter-toth
left a comment
There was a problem hiding this comment.
Makes sense and looks like a nice improvement, but @gengliangwang can you please check the 2 failures and fix lint errors.
|
Closing this PR — found a regression on The regressionQuery #28 expects Root causeThe optimization replaces: val afterRule = CurrentOrigin.withOrigin(origin) {
rule.applyOrElse(this, identity[BaseType])
}with: val afterRule = if (rule.isDefinedAt(this)) {
CurrentOrigin.withOrigin(origin) { rule.apply(this) }
} else { this }The intent: skip For
Spark eagerly evaluates window frame bound expressions during analysis (they must be foldable). Trace evidenceWhy catalyst tests didn't catch thisThe exception itself is thrown either way — only the Why this optimization is fundamentally unsafeWherever a The fix patterns I considered all either:
None gains the perf and stays correct. What's worth keeping from the investigationJFR profile on a representative
JIRA SPARK-56869 will stay open for the next attempt. |
What changes were proposed in this pull request?
Gate
CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) }in fourTreeNoderule-driven transform methods behindrule.isDefinedAt(...), so the ThreadLocal wrap only runs when the rule actually fires:TreeNode.transformDownWithPruningTreeNode.transformUpWithPruningTreeNode.transformUpWithBeforeAndAfterRuleOnChildrenTreeNode.multiTransformDownWithPruning(also drops a side-effecting default closure that became unnecessary)Why are the changes needed?
CurrentOrigin.withOriginis only observable when the rule constructs new nodes — they pick upCurrentOrigin.getin theiroverride val originfield. On nodes the rule doesn't match, the wrap is pure overhead: twoThreadLocalwrites plus atry/finallyper node visit.JFR profiling (60s sample, 1.1M iterations of
transformDownover a 1024-leaf balancedAddtree with a non-matching rule) shows:ThreadLocalMap.set(line 486)ThreadLocalMap.getEntryAfterMissThreadLocalMap.set(line 493)Total: ~88% of transform CPU spent inside
CurrentOrigin.withOriginfor nodes the rule never matched.Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz, baseline =
upstream/master):transformDowndeep chain(5000) no-optransformDowndeep chain(5000) rewrite leaftransformDownwide(100) no-optransformDownbalanced(1024) no-optransformDownbalanced(4096) no-optransformUpdeep chain(1000) no-optransformUpdeep chain(5000) no-optransformUpbalanced(1024) no-optransformUpbalanced(4096) no-opRewrite-heavy cases are unchanged because
withOriginstill runs when the rule fires. Real Spark workloads (analyzer / optimizer batches running many rules across many nodes, where each rule matches a small subset) are dominated by the no-match case, so these savings should compound.Does this PR introduce any user-facing change?
No. The change is internal to
TreeNoderule machinery and preserves all observable semantics:CurrentOriginis still set before any rule body that constructs new nodes runs.markRuleAsIneffective/isRuleIneffectivebookkeeping unchanged.copyTagsFromordering on rule-replacement unchanged.fastEqualsshort-circuit unchanged.this eq resulton a no-op transform) preserved.How was this patch tested?
build/sbt 'catalyst/testOnly *TreeNodeSuite'— 36 / 36 pass.build/sbt 'catalyst/test'— 9272 / 9272 pass across 352 suites (~7 min).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus 4.7)