ORCA: order-preserving LINT mapping for string statistics#1742
Draft
yjhjstz wants to merge 8 commits into
Draft
Conversation
26ec795 to
6890283
Compare
Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.
6890283 to
56cb1fc
Compare
ORCA previously parsed pg_statistic_ext metadata for dependencies and ndistinct only; MCV-typed statistics were registered by kind but never loaded or used, so AND/inequality predicates on correlated columns kept multiplying per-column selectivities and produced order-of-magnitude underestimates. Wire MCV through the MD layer and into filter selectivity: - New CMDMVMCVList / CMDMVMCVItem MD types carry per-item per-column IDatum values, NULL flags, and frequency / base_frequency. - IMDExtStats grows GetMCVList(); CDXLExtStats owns an optional list. - gpdb wrappers expose statext_mcv_load, pg_statistic_ext.stxkeys / stxrelid, and get_atttypetypmodcoll so the loader can build IDatums using the column's actual collation (raw-byte LINT mapping diverges from the predicate side otherwise, breaking equality matches). - RetrieveExtStats decodes MCVList items via the existing CreateIDatumFromGpdbDatum path. - New CMCVStatsProcessor mirrors mcv_clauselist_selectivity + mcv_combine_selectivities: pick the best MCV stat by attno overlap, build a per-item match bitmap by evaluating each Point predicate with IDatum comparators (handling IS NULL / IS NOT NULL via the const's IsNull bit), aggregate mcv_sel/basesel/totalsel, compute simple_sel from per-column histograms, and append a single combined scale factor while marking the consumed clauses SetEstimated. - CFilterStatsProcessor calls the MCV processor before the dependency path so MCV consumes correlated clauses first; drop the scale_factors-must-be-empty assert that this ordering now violates. - Refresh stats_ext_optimizer.out (and the pax_storage mirror) to reflect estimates that now match (or far better approximate) the actual row counts. DXL round-trip for MCV is intentionally skipped in this pass; live planner derivation is unaffected, only minidump replay loses MCV info.
CMCVStatsProcessor::ApplyMCVStatsToFilter already builds a filtered CHistogram per consumed clause (for the simple_sel computation) and then discards it. Install it back into result_histograms via CStatisticsUtils::AddHistogram(..., fReplaceOld=true) instead, matching what CFilterStatsProcessor::MakeHistHashMapConjFilter's main per-clause loop does at L406-408. No stats_ext test changes its output -- the disjunction path that prompted the investigation combines children via scale_factors, not per-column histograms, so this doesn't affect it. The fix matters for downstream derivations that re-read the per-column shape (nested filter, equi-join NDV, GROUP BY ndistinct) on a column whose selectivity has already been applied via MCV: without the replace, those consumers would see the pre-filter histogram alongside the correct row count, which can pull join cardinality off by the unfiltered NDV. The dependency processor has the same latent issue; not addressed here, but the pattern is now established for that follow-up.
…simple_sel Two correctness fixes in CMCVStatsProcessor::ApplyMCVStatsToFilter that the initial MCV implementation had: 1. Multiple clauses on the same column (e.g. "a > 0 AND a < 10 AND b = 0") were being silently dropped after the first one. The bitmap step handles them correctly under AND semantics, but the candidate collector was deduping by attno, which left the surplus clauses without SetEstimated, so the main per-clause loop would compute their selectivity again and multiply it into the final scale factor -- double counting. Drop the dedup; the distinct-attno gate is preserved via the bitset's Size() check. 2. NULL-const predicates (IS NULL / IS NOT NULL) were being skipped in the simple_sel computation. mcv_basesel does include matched-NULL- item base frequencies, so other_sel = simple_sel - mcv_basesel saw a non-zero residual; full-coverage MCV (mcv_totalsel ≈ 1) clamped it harmlessly, but partial coverage produced an over-estimate. Compute per-clause selectivity from CHistogram::GetNullFreq() for these clauses instead of skipping. While here, restructure step 7 to group matched clauses by colid and apply them sequentially to a working histogram, mirroring the joint- filter pattern in MakeHistHashMapConjFilter's main loop. This is what makes fix apache#1 correct -- two clauses on the same column become a histogram intersection rather than a (wrong) product of independent selectivities. Verified: - WHERE a >= 0 AND a < 10 AND b = 0 -> ORCA estimates 50, actual 50 (previously double-counted -> 17 or similar) - WHERE a IS NULL AND b = 0 -> ORCA estimates 100, actual 100 - stats_ext regression still passes with the existing baseline.
CStatsPredPoint is used for both = / <> / inequality predicates and
the SQL three-valued-logic predicates IS [NOT] DISTINCT FROM, so the
"EsptPoint" candidate filter in ApplyMCVStatsToFilter accepted both.
The MCV bitmap step downstream cannot express IDF/INDF semantics
correctly:
* IS NOT DISTINCT FROM NULL was mis-classified as IS NOT NULL: the
bitmap branches on (cmp_type == EstatscmptEq) when the const is
NULL, so INDF with a NULL const sets want_null=false and matches
non-NULL items.
* IS DISTINCT FROM <non-null> killed every MCV item: the non-NULL
const path knocks out NULL items first (wrong -- IDF should match
NULLs), and EvaluateCmpOnIDatum's default branch returns false
for IDF/INDF, knocking out the rest. The clauses are then
SetEstimated, so the main per-clause loop does not get to
correct it, producing ~0 row estimates that derail join planning.
Skip EstatscmptIDF / EstatscmptINDF at candidate collection so the
downstream per-clause histogram path handles them. While here,
tighten EvaluateCmpOnIDatum's default branch to GPOS_ASSERT so any
future widening of the candidate set catches a missing case loudly
in debug builds while still falling through safely in release.
Replace the legacy Greenplum-style header on the four new files introduced by the multivariate MCV support commit with the standard Apache 2.0 license header.
GetExtStatsKinds() catches any catalog error and returns nullptr when the pg_statistic_ext row no longer exists (e.g. after DROP STATISTICS). Previously the code fell through to GetExtStatsName(), which also returns nullptr on error, and then passed that nullptr directly to CDXLUtils::CreateDynamicStringFromCharArray() — whose GPOS_ASSERT fires in debug builds (ORCA falls back to the PG planner) but in release builds the assert is a no-op, leading to a null-pointer dereference and SIGSEGV. Fix with two guards: 1. When GetExtStatsKinds() returns nullptr, return a minimal empty CDXLExtStats placeholder immediately so the MD cache stores a valid (inert) object and downstream code never sees an uninitialized entry. 2. After the kinds check passes, still null-check the result of GetExtStatsName() and substitute an empty string, making the function safe even in partial-failure scenarios.
ReleaseSysCache(htup) was called before NameStr(staForm->stxname) was read, returning a pointer into the already-released tuple buffer. Copy the name with pstrdup() first, then release the cache entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, string statistics were mapped to LINT values via hashtext, which does not preserve lexicographic order. As a result, ORCA's range/comparison estimates on string columns could be wildly inaccurate. This change introduces an order-preserving mapping so that a < b on strings implies LINT(a) < LINT(b), restoring meaningful selectivity estimates for inequality predicates on text columns.
Replace the hash-based LINT mapping used by ORCA's varchar/text/bpchar statistics with an order-preserving 7-byte locale sort-key prefix, so that single-column MCV and histogram estimates respect the column's collation order instead of collapsing to hash values.
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions