[SPARK-46367][SQL][FOLLOWUP] Assert same arity in projectKeyedPartitionings#55876
[SPARK-46367][SQL][FOLLOWUP] Assert same arity in projectKeyedPartitionings#55876cloud-fan wants to merge 2 commits into
Conversation
a828e56 to
642ae0e
Compare
|
cc @peter-toth |
|
@cloud-fan , I don't think this is valid. There simply can't be mixed-arity and actualy, it doesn't make sense to have/keep lower arity KPs in a collection if there are higher arity ones in it because the expressions of the lower must be the subset of the expressions of the higher. If we know more details of the keys of partitions, like KP([a, b], [(1, 1), (2, 2)]) so we know that a = 1 and b = 1 in the first; and a = 2 and b = 2 in the second partition, why would we have/keep the less granular KP([a], [(1), (2)]) or KP([b], [(1), (2)]) in the collection?Maybe the wording of the comment is not the best but the " partitionKeys must match" means that arity of KPs in a collection must be the same.
|
…onings ### What changes were proposed in this pull request? Follow-up to apache#55519. `PartitioningPreservingUnaryExecNode.projectKeyedPartitionings` assumes all input KPs share the same `partitionKeys`, which implies the same expression arity. This invariant is asserted by `GroupPartitionsExec` and is established by every upstream constructor of `PartitioningCollection` that feeds this method (a join's `PartitioningCollection(left.outputPartitioning, right.outputPartitioning)` combines KPs that `EnsureRequirements` has aligned to the same join keys). If the invariant is ever violated upstream, indexing `kp.expressions(i)` for `i >= kp.expressions.length` throws an opaque `IndexOutOfBoundsException` that points at this method rather than at the producer. Add an `assert` that surfaces the real cause with a clear message. The invariant is unchanged; this just turns silent misuse into a debuggable failure, so a producer-side bug can be fixed at its source. ### Why are the changes needed? Improve error message when an upstream node violates the same-arity invariant. No behavior change on valid plans. ### Does this PR introduce _any_ user-facing change? No (assertion-only; planner internal). ### How was this patch tested? New unit test `SPARK-46367: mixed-arity KeyedPartitionings in input fail with a clear assertion` in `ProjectedOrderingAndPartitioningSuite` that constructs a mixed-arity `PartitioningCollection` child and verifies the assert fires with the expected message instead of throwing `IndexOutOfBoundsException`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
642ae0e to
b5fd4be
Compare
|
@peter-toth you're right — the invariant should stay strict. I've reworked this PR to just add an |
| // that have been aligned by [[EnsureRequirements]] to the same join keys). If the invariant | ||
| // is ever violated upstream, fail early with a clear message instead of throwing an opaque | ||
| // `IndexOutOfBoundsException` from `kp.expressions(i)` below. | ||
| assert(kps.forall(_.expressions.length == numPositions), |
There was a problem hiding this comment.
Maybe you could use kps.tail.forall(...).
Please note that in GroupPartitionsExec we use .partitionKeys comparison, which is even more stricter, but more costly.
There was a problem hiding this comment.
I wonder if we should move these invariant checks into PartitioningCollection.
Maybe expressions.length = + partitionKeys eq checks on KPs for maximum efficiency.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @cloud-fan and @peter-toth .
Use `kps.tail.forall(...)` since `numPositions` is taken from `kps.head`.
|
The parquet test failure is unrelated, thanks for review, merging to master/4.x! |
…onings ### What changes were proposed in this pull request? Follow-up to #55519. `PartitioningPreservingUnaryExecNode.projectKeyedPartitionings` assumes all input KPs share the same `partitionKeys`, which implies the same expression arity. This invariant is asserted by `GroupPartitionsExec` and is established by every upstream constructor of `PartitioningCollection` that feeds this method (a join's `PartitioningCollection(left.outputPartitioning, right.outputPartitioning)` combines KPs that `EnsureRequirements` has aligned to the same join keys). If the invariant is ever violated upstream, indexing `kp.expressions(i)` for `i >= kp.expressions.length` throws an opaque `IndexOutOfBoundsException` that points at this method rather than at the producer. This PR adds an `assert` that surfaces the real cause with a clear message. The invariant is unchanged; this just turns silent misuse into a debuggable failure, so a producer-side bug can be fixed at its source. ### Why are the changes needed? Improve error message when an upstream node violates the same-arity invariant. No behavior change on valid plans. ### Does this PR introduce _any_ user-facing change? No (assertion-only; planner internal). ### How was this patch tested? New unit test `SPARK-46367: mixed-arity KeyedPartitionings in input fail with a clear assertion` in `ProjectedOrderingAndPartitioningSuite` that constructs a mixed-arity `PartitioningCollection` child and verifies the assert fires with the expected message instead of throwing `IndexOutOfBoundsException`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 Closes #55876 from cloud-fan/SPARK-46367-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2530561) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Follow-up to #55519.
PartitioningPreservingUnaryExecNode.projectKeyedPartitioningsassumes all input KPs share the samepartitionKeys, which implies the same expression arity. This invariant is asserted byGroupPartitionsExecand is established by every upstream constructor ofPartitioningCollectionthat feeds this method (a join'sPartitioningCollection(left.outputPartitioning, right.outputPartitioning)combines KPs thatEnsureRequirementshas aligned to the same join keys).If the invariant is ever violated upstream, indexing
kp.expressions(i)fori >= kp.expressions.lengththrows an opaqueIndexOutOfBoundsExceptionthat points at this method rather than at the producer.This PR adds an
assertthat surfaces the real cause with a clear message. The invariant is unchanged; this just turns silent misuse into a debuggable failure, so a producer-side bug can be fixed at its source.Why are the changes needed?
Improve error message when an upstream node violates the same-arity invariant. No behavior change on valid plans.
Does this PR introduce any user-facing change?
No (assertion-only; planner internal).
How was this patch tested?
New unit test
SPARK-46367: mixed-arity KeyedPartitionings in input fail with a clear assertioninProjectedOrderingAndPartitioningSuitethat constructs a mixed-arityPartitioningCollectionchild and verifies the assert fires with the expected message instead of throwingIndexOutOfBoundsException.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7