Skip to content

ORCA: order-preserving LINT mapping for string statistics#1742

Draft
yjhjstz wants to merge 8 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering
Draft

ORCA: order-preserving LINT mapping for string statistics#1742
yjhjstz wants to merge 8 commits into
apache:mainfrom
yjhjstz:orca/string-stats-ordering

Conversation

@yjhjstz
Copy link
Copy Markdown
Member

@yjhjstz yjhjstz commented May 13, 2026

Previously, string statistics were mapped to LINT values via hashtext, which does not preserve lexicographic order. As a result, ORCA's range/comparison estimates on string columns could be wildly inaccurate. This change introduces an order-preserving mapping so that a < b on strings implies LINT(a) < LINT(b), restoring meaningful selectivity estimates for inequality predicates on text columns.

Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@yjhjstz yjhjstz force-pushed the orca/string-stats-ordering branch from 26ec795 to 6890283 Compare May 14, 2026 01:21
Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar
statistics with an order-preserving 7-byte locale sort-key prefix, so
that single-column MCV and histogram estimates respect the column's
collation order instead of collapsing to hash values.
@yjhjstz yjhjstz force-pushed the orca/string-stats-ordering branch from 6890283 to 56cb1fc Compare May 14, 2026 02:06
yjhjstz added 7 commits May 15, 2026 04:29
ORCA previously parsed pg_statistic_ext metadata for dependencies and
ndistinct only; MCV-typed statistics were registered by kind but never
loaded or used, so AND/inequality predicates on correlated columns kept
multiplying per-column selectivities and produced order-of-magnitude
underestimates.

Wire MCV through the MD layer and into filter selectivity:

- New CMDMVMCVList / CMDMVMCVItem MD types carry per-item per-column
  IDatum values, NULL flags, and frequency / base_frequency.
- IMDExtStats grows GetMCVList(); CDXLExtStats owns an optional list.
- gpdb wrappers expose statext_mcv_load, pg_statistic_ext.stxkeys /
  stxrelid, and get_atttypetypmodcoll so the loader can build IDatums
  using the column's actual collation (raw-byte LINT mapping diverges
  from the predicate side otherwise, breaking equality matches).
- RetrieveExtStats decodes MCVList items via the existing
  CreateIDatumFromGpdbDatum path.
- New CMCVStatsProcessor mirrors mcv_clauselist_selectivity +
  mcv_combine_selectivities: pick the best MCV stat by attno overlap,
  build a per-item match bitmap by evaluating each Point predicate with
  IDatum comparators (handling IS NULL / IS NOT NULL via the const's
  IsNull bit), aggregate mcv_sel/basesel/totalsel, compute simple_sel
  from per-column histograms, and append a single combined scale factor
  while marking the consumed clauses SetEstimated.
- CFilterStatsProcessor calls the MCV processor before the dependency
  path so MCV consumes correlated clauses first; drop the
  scale_factors-must-be-empty assert that this ordering now violates.
- Refresh stats_ext_optimizer.out (and the pax_storage mirror) to
  reflect estimates that now match (or far better approximate) the
  actual row counts.

DXL round-trip for MCV is intentionally skipped in this pass; live
planner derivation is unaffected, only minidump replay loses MCV info.
CMCVStatsProcessor::ApplyMCVStatsToFilter already builds a filtered
CHistogram per consumed clause (for the simple_sel computation) and
then discards it. Install it back into result_histograms via
CStatisticsUtils::AddHistogram(..., fReplaceOld=true) instead, matching
what CFilterStatsProcessor::MakeHistHashMapConjFilter's main per-clause
loop does at L406-408.

No stats_ext test changes its output -- the disjunction path that
prompted the investigation combines children via scale_factors, not
per-column histograms, so this doesn't affect it. The fix matters for
downstream derivations that re-read the per-column shape (nested
filter, equi-join NDV, GROUP BY ndistinct) on a column whose
selectivity has already been applied via MCV: without the replace,
those consumers would see the pre-filter histogram alongside the
correct row count, which can pull join cardinality off by the
unfiltered NDV.

The dependency processor has the same latent issue; not addressed
here, but the pattern is now established for that follow-up.
…simple_sel

Two correctness fixes in CMCVStatsProcessor::ApplyMCVStatsToFilter that
the initial MCV implementation had:

1. Multiple clauses on the same column (e.g. "a > 0 AND a < 10 AND b = 0")
   were being silently dropped after the first one. The bitmap step
   handles them correctly under AND semantics, but the candidate
   collector was deduping by attno, which left the surplus clauses
   without SetEstimated, so the main per-clause loop would compute their
   selectivity again and multiply it into the final scale factor --
   double counting. Drop the dedup; the distinct-attno gate is preserved
   via the bitset's Size() check.

2. NULL-const predicates (IS NULL / IS NOT NULL) were being skipped in
   the simple_sel computation. mcv_basesel does include matched-NULL-
   item base frequencies, so other_sel = simple_sel - mcv_basesel saw a
   non-zero residual; full-coverage MCV (mcv_totalsel ≈ 1) clamped it
   harmlessly, but partial coverage produced an over-estimate. Compute
   per-clause selectivity from CHistogram::GetNullFreq() for these
   clauses instead of skipping.

While here, restructure step 7 to group matched clauses by colid and
apply them sequentially to a working histogram, mirroring the joint-
filter pattern in MakeHistHashMapConjFilter's main loop. This is what
makes fix apache#1 correct -- two clauses on the same column become a
histogram intersection rather than a (wrong) product of independent
selectivities.

Verified:
- WHERE a >= 0 AND a < 10 AND b = 0 -> ORCA estimates 50, actual 50
  (previously double-counted -> 17 or similar)
- WHERE a IS NULL AND b = 0 -> ORCA estimates 100, actual 100
- stats_ext regression still passes with the existing baseline.
CStatsPredPoint is used for both = / <> / inequality predicates and
the SQL three-valued-logic predicates IS [NOT] DISTINCT FROM, so the
"EsptPoint" candidate filter in ApplyMCVStatsToFilter accepted both.
The MCV bitmap step downstream cannot express IDF/INDF semantics
correctly:

  * IS NOT DISTINCT FROM NULL was mis-classified as IS NOT NULL: the
    bitmap branches on (cmp_type == EstatscmptEq) when the const is
    NULL, so INDF with a NULL const sets want_null=false and matches
    non-NULL items.

  * IS DISTINCT FROM <non-null> killed every MCV item: the non-NULL
    const path knocks out NULL items first (wrong -- IDF should match
    NULLs), and EvaluateCmpOnIDatum's default branch returns false
    for IDF/INDF, knocking out the rest. The clauses are then
    SetEstimated, so the main per-clause loop does not get to
    correct it, producing ~0 row estimates that derail join planning.

Skip EstatscmptIDF / EstatscmptINDF at candidate collection so the
downstream per-clause histogram path handles them. While here,
tighten EvaluateCmpOnIDatum's default branch to GPOS_ASSERT so any
future widening of the candidate set catches a missing case loudly
in debug builds while still falling through safely in release.
Replace the legacy Greenplum-style header on the four new files
introduced by the multivariate MCV support commit with the standard
Apache 2.0 license header.
GetExtStatsKinds() catches any catalog error and returns nullptr when
the pg_statistic_ext row no longer exists (e.g. after DROP STATISTICS).
Previously the code fell through to GetExtStatsName(), which also
returns nullptr on error, and then passed that nullptr directly to
CDXLUtils::CreateDynamicStringFromCharArray() — whose GPOS_ASSERT fires
in debug builds (ORCA falls back to the PG planner) but in release
builds the assert is a no-op, leading to a null-pointer dereference and
SIGSEGV.

Fix with two guards:
1. When GetExtStatsKinds() returns nullptr, return a minimal empty
   CDXLExtStats placeholder immediately so the MD cache stores a valid
   (inert) object and downstream code never sees an uninitialized entry.
2. After the kinds check passes, still null-check the result of
   GetExtStatsName() and substitute an empty string, making the function
   safe even in partial-failure scenarios.
ReleaseSysCache(htup) was called before NameStr(staForm->stxname) was
read, returning a pointer into the already-released tuple buffer.
Copy the name with pstrdup() first, then release the cache entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant