Skip to content

Make NDV merge order-invariant with multi-input overlap estimation#21313

Open
kosiew wants to merge 4 commits intoapache:mainfrom
kosiew:ndv-nonassociativity-20966
Open

Make NDV merge order-invariant with multi-input overlap estimation#21313
kosiew wants to merge 4 commits intoapache:mainfrom
kosiew:ndv-nonassociativity-20966

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 2, 2026

Which issue does this PR close?


Rationale for this change

The existing NDV merge implementation (estimate_ndv_with_overlap) is not associative, meaning that merging statistics in different orders can produce different results. This happens because intermediate merges "smear" min/max/NDV values, which then influence subsequent computations.

This lack of order invariance leads to inconsistent NDV estimates depending on merge order, especially when combining more than two inputs.

To address this, this PR introduces a multi-input NDV estimation approach that computes overlap using the original (unsmeared) statistics across all inputs simultaneously. This ensures stable and deterministic results regardless of merge order.


What changes are included in this PR?

  • Refactored Statistics::try_merge_iter to:

    • Perform a single pass for mergeable statistics (null counts, min/max, sum).
    • Defer NDV computation until after all inputs are collected.
  • Replaced pairwise NDV merging with a new multi-input algorithm:

    • Introduced estimate_ndv_with_overlap_many to compute NDV across all inputs simultaneously.
    • Uses segment-based partitioning of value ranges and selects the maximum density contribution per segment.
    • Handles constant (point) values separately to avoid over-counting.
  • Ensured order invariance:

    • NDV is now computed from original inputs instead of intermediate merged states.
  • Added helper:

    • max_input_ndv as a fallback when estimation is not possible.
  • Code cleanup:

    • Simplified iteration using zip instead of index-based access.
    • Improved clarity and separation of concerns in merge logic.

Are these changes tested?

Yes.

New and updated tests include:

  • Reusable helpers for test setup (int32_test_schema, int32_ndv_stats).

  • Validation of NDV behavior for:

    • Identical ranges (full overlap).
    • Partial overlaps.
    • Constant columns (same and different values).
  • New test ensuring order invariance across permutations, verifying that all permutations of inputs produce identical NDV results.

These tests ensure correctness, stability, and regression protection for the new algorithm.


Are there any user-facing changes?

No direct user-facing API changes.

However, NDV estimates are now:

  • More stable
  • Deterministic across merge orders
  • Potentially slightly different (more accurate) compared to previous behavior

This improves query planning consistency without requiring user intervention.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 3 commits April 2, 2026 10:26
Aggregate easy fields first and recompute distinct_count from
original per-input column stats. Utilize a new multi-way overlap
helper for merging instead of pairwise-smeared NDV/min/max.
Added regression coverage for three-input counterexample and
permutation-based order invariance.
Replace the index-heavy merge loop with zip-based iteration.
Simplify the NDV recomputation path using map(...).unwrap_or(...).
Extract a private max_input_ndv helper for the fallback case.
Deduplicate scalar sort/dedup logic in
estimate_ndv_with_overlap_many. Add small private test helpers to
reduce repeated one-column Int32 NDV fixture setup.
@github-actions github-actions bot added the common Related to common crate label Apr 2, 2026
@kosiew kosiew marked this pull request as ready for review April 2, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve NDV merge to be associative across multiple inputs

1 participant