ORCA: order-preserving LINT mapping for string statistics by yjhjstz · Pull Request #1742 · apache/cloudberry

yjhjstz · 2026-05-13T23:48:37Z

Previously, string statistics were mapped to LINT values via hashtext, which does not preserve lexicographic order. As a result, ORCA's range/comparison estimates on string columns could be wildly inaccurate. This change introduces an order-preserving mapping so that a < b on strings implies LINT(a) < LINT(b), restoring meaningful selectivity estimates for inequality predicates on text columns.

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

ORCA previously parsed pg_statistic_ext metadata for dependencies and ndistinct only; MCV-typed statistics were registered by kind but never loaded or used, so AND/inequality predicates on correlated columns kept multiplying per-column selectivities and produced order-of-magnitude underestimates. Wire MCV through the MD layer and into filter selectivity: - New CMDMVMCVList / CMDMVMCVItem MD types carry per-item per-column IDatum values, NULL flags, and frequency / base_frequency. - IMDExtStats grows GetMCVList(); CDXLExtStats owns an optional list. - gpdb wrappers expose statext_mcv_load, pg_statistic_ext.stxkeys / stxrelid, and get_atttypetypmodcoll so the loader can build IDatums using the column's actual collation (raw-byte LINT mapping diverges from the predicate side otherwise, breaking equality matches). - RetrieveExtStats decodes MCVList items via the existing CreateIDatumFromGpdbDatum path. - New CMCVStatsProcessor mirrors mcv_clauselist_selectivity + mcv_combine_selectivities: pick the best MCV stat by attno overlap, build a per-item match bitmap by evaluating each Point predicate with IDatum comparators (handling IS NULL / IS NOT NULL via the const's IsNull bit), aggregate mcv_sel/basesel/totalsel, compute simple_sel from per-column histograms, and append a single combined scale factor while marking the consumed clauses SetEstimated. - CFilterStatsProcessor calls the MCV processor before the dependency path so MCV consumes correlated clauses first; drop the scale_factors-must-be-empty assert that this ordering now violates. - Refresh stats_ext_optimizer.out (and the pax_storage mirror) to reflect estimates that now match (or far better approximate) the actual row counts. DXL round-trip for MCV is intentionally skipped in this pass; live planner derivation is unaffected, only minidump replay loses MCV info.

CMCVStatsProcessor::ApplyMCVStatsToFilter already builds a filtered CHistogram per consumed clause (for the simple_sel computation) and then discards it. Install it back into result_histograms via CStatisticsUtils::AddHistogram(..., fReplaceOld=true) instead, matching what CFilterStatsProcessor::MakeHistHashMapConjFilter's main per-clause loop does at L406-408. No stats_ext test changes its output -- the disjunction path that prompted the investigation combines children via scale_factors, not per-column histograms, so this doesn't affect it. The fix matters for downstream derivations that re-read the per-column shape (nested filter, equi-join NDV, GROUP BY ndistinct) on a column whose selectivity has already been applied via MCV: without the replace, those consumers would see the pre-filter histogram alongside the correct row count, which can pull join cardinality off by the unfiltered NDV. The dependency processor has the same latent issue; not addressed here, but the pattern is now established for that follow-up.

…simple_sel Two correctness fixes in CMCVStatsProcessor::ApplyMCVStatsToFilter that the initial MCV implementation had: 1. Multiple clauses on the same column (e.g. "a > 0 AND a < 10 AND b = 0") were being silently dropped after the first one. The bitmap step handles them correctly under AND semantics, but the candidate collector was deduping by attno, which left the surplus clauses without SetEstimated, so the main per-clause loop would compute their selectivity again and multiply it into the final scale factor -- double counting. Drop the dedup; the distinct-attno gate is preserved via the bitset's Size() check. 2. NULL-const predicates (IS NULL / IS NOT NULL) were being skipped in the simple_sel computation. mcv_basesel does include matched-NULL- item base frequencies, so other_sel = simple_sel - mcv_basesel saw a non-zero residual; full-coverage MCV (mcv_totalsel ≈ 1) clamped it harmlessly, but partial coverage produced an over-estimate. Compute per-clause selectivity from CHistogram::GetNullFreq() for these clauses instead of skipping. While here, restructure step 7 to group matched clauses by colid and apply them sequentially to a working histogram, mirroring the joint- filter pattern in MakeHistHashMapConjFilter's main loop. This is what makes fix apache#1 correct -- two clauses on the same column become a histogram intersection rather than a (wrong) product of independent selectivities. Verified: - WHERE a >= 0 AND a < 10 AND b = 0 -> ORCA estimates 50, actual 50 (previously double-counted -> 17 or similar) - WHERE a IS NULL AND b = 0 -> ORCA estimates 100, actual 100 - stats_ext regression still passes with the existing baseline.

CStatsPredPoint is used for both = / <> / inequality predicates and the SQL three-valued-logic predicates IS [NOT] DISTINCT FROM, so the "EsptPoint" candidate filter in ApplyMCVStatsToFilter accepted both. The MCV bitmap step downstream cannot express IDF/INDF semantics correctly: * IS NOT DISTINCT FROM NULL was mis-classified as IS NOT NULL: the bitmap branches on (cmp_type == EstatscmptEq) when the const is NULL, so INDF with a NULL const sets want_null=false and matches non-NULL items. * IS DISTINCT FROM <non-null> killed every MCV item: the non-NULL const path knocks out NULL items first (wrong -- IDF should match NULLs), and EvaluateCmpOnIDatum's default branch returns false for IDF/INDF, knocking out the rest. The clauses are then SetEstimated, so the main per-clause loop does not get to correct it, producing ~0 row estimates that derail join planning. Skip EstatscmptIDF / EstatscmptINDF at candidate collection so the downstream per-clause histogram path handles them. While here, tighten EvaluateCmpOnIDatum's default branch to GPOS_ASSERT so any future widening of the candidate set catches a missing case loudly in debug builds while still falling through safely in release.

Replace the legacy Greenplum-style header on the four new files introduced by the multivariate MCV support commit with the standard Apache 2.0 license header.

GetExtStatsKinds() catches any catalog error and returns nullptr when the pg_statistic_ext row no longer exists (e.g. after DROP STATISTICS). Previously the code fell through to GetExtStatsName(), which also returns nullptr on error, and then passed that nullptr directly to CDXLUtils::CreateDynamicStringFromCharArray() — whose GPOS_ASSERT fires in debug builds (ORCA falls back to the PG planner) but in release builds the assert is a no-op, leading to a null-pointer dereference and SIGSEGV. Fix with two guards: 1. When GetExtStatsKinds() returns nullptr, return a minimal empty CDXLExtStats placeholder immediately so the MD cache stores a valid (inert) object and downstream code never sees an uninitialized entry. 2. After the kinds check passes, still null-check the result of GetExtStatsName() and substitute an empty string, making the function safe even in partial-failure scenarios.

ReleaseSysCache(htup) was called before NameStr(staForm->stxname) was read, returning a pointer into the already-released tuple buffer. Copy the name with pstrdup() first, then release the cache entry.

yjhjstz force-pushed the orca/string-stats-ordering branch from 26ec795 to 6890283 Compare May 14, 2026 01:21

yjhjstz force-pushed the orca/string-stats-ordering branch from 6890283 to 56cb1fc Compare May 14, 2026 02:06

yjhjstz added 7 commits May 15, 2026 04:29

ORCA: add Apache license header to new MCV source files

d2dfa6c

Replace the legacy Greenplum-style header on the four new files introduced by the multivariate MCV support commit with the standard Apache 2.0 license header.

Fix use-after-free in GetExtStatisticsName

03d2338

ReleaseSysCache(htup) was called before NameStr(staForm->stxname) was read, returning a pointer into the already-released tuple buffer. Copy the name with pstrdup() first, then release the cache entry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORCA: order-preserving LINT mapping for string statistics#1742

ORCA: order-preserving LINT mapping for string statistics#1742
yjhjstz wants to merge 8 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering

yjhjstz commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yjhjstz commented May 13, 2026

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant