Make NDV merge order-invariant with multi-input overlap estimation#21313
Open
kosiew wants to merge 4 commits intoapache:mainfrom
Open
Make NDV merge order-invariant with multi-input overlap estimation#21313kosiew wants to merge 4 commits intoapache:mainfrom
kosiew wants to merge 4 commits intoapache:mainfrom
Conversation
Aggregate easy fields first and recompute distinct_count from original per-input column stats. Utilize a new multi-way overlap helper for merging instead of pairwise-smeared NDV/min/max. Added regression coverage for three-input counterexample and permutation-based order invariance.
Replace the index-heavy merge loop with zip-based iteration. Simplify the NDV recomputation path using map(...).unwrap_or(...). Extract a private max_input_ndv helper for the fallback case. Deduplicate scalar sort/dedup logic in estimate_ndv_with_overlap_many. Add small private test helpers to reduce repeated one-column Int32 NDV fixture setup.
…imate_ndv_with_overlap_many function
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The existing NDV merge implementation (
estimate_ndv_with_overlap) is not associative, meaning that merging statistics in different orders can produce different results. This happens because intermediate merges "smear" min/max/NDV values, which then influence subsequent computations.This lack of order invariance leads to inconsistent NDV estimates depending on merge order, especially when combining more than two inputs.
To address this, this PR introduces a multi-input NDV estimation approach that computes overlap using the original (unsmeared) statistics across all inputs simultaneously. This ensures stable and deterministic results regardless of merge order.
What changes are included in this PR?
Refactored
Statistics::try_merge_iterto:Replaced pairwise NDV merging with a new multi-input algorithm:
estimate_ndv_with_overlap_manyto compute NDV across all inputs simultaneously.Ensured order invariance:
Added helper:
max_input_ndvas a fallback when estimation is not possible.Code cleanup:
zipinstead of index-based access.Are these changes tested?
Yes.
New and updated tests include:
Reusable helpers for test setup (
int32_test_schema,int32_ndv_stats).Validation of NDV behavior for:
New test ensuring order invariance across permutations, verifying that all permutations of inputs produce identical NDV results.
These tests ensure correctness, stability, and regression protection for the new algorithm.
Are there any user-facing changes?
No direct user-facing API changes.
However, NDV estimates are now:
This improves query planning consistency without requiring user intervention.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.