[Enhancement](variant) Support mixed subcolumn/doc-value in doc mode#61539
Open
csun5285 wants to merge 3 commits intoapache:masterfrom
Open
[Enhancement](variant) Support mixed subcolumn/doc-value in doc mode#61539csun5285 wants to merge 3 commits intoapache:masterfrom
csun5285 wants to merge 3 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
1 similar comment
Contributor
Author
|
run buildall |
Contributor
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 27003 ms |
a9bc670 to
98c1b07
Compare
Contributor
Author
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
FE UT Coverage ReportIncrement line coverage |
Contributor
Author
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 26947 ms |
TPC-DS: Total hot run time: 168184 ms |
Contributor
FE UT Coverage ReportIncrement line coverage |
…nd variant_max_subcolumns_count in doc mode
1. BE: Support mixed-format compaction for doc-mode variant columns.
- Add SubcolumnToDocCompactIterator to convert subcolumn-only
segments into doc-value format during compaction read.
- Add SUBCOLUMN_TO_DOC ReadKind in variant_column_reader to handle
segments that have subcolumn data but need to be read as doc-value
buckets (e.g. when max_subcolumns_count > actual path count).
- Detect actual segment storage format (has_doc_value_segments) in
aggregate_variant_extended_info to decide whether to output
doc-value bucket columns or subcolumn columns in compaction schema.
- Support ColumnVariant::insert_from/insert_range_from between
doc-value mode and subcolumn mode by auto-converting formats.
2. BE: Refactor parse_json_to_variant to support max_subcolumns_count.
- Split into parse_json_to_result() + insert_parse_result_into_variant()
so parsed results can be inspected before insertion.
- Add max_subcolumns_count to ParseConfig to control doc-mode
threshold at parse time.
3. FE: Allow variant_max_subcolumns_count in doc mode.
- Remove validation that prevented variant_max_subcolumns_count
and variant_enable_doc_mode from being set together.
- Output variant_max_subcolumns_count in VariantType.toSql() when
doc mode is enabled (both nereids and fe-type).
- Randomize defaultVariantMaxSubcolumnsCount in fuzzy test mode
when doc mode is enabled.
4. FE: Fix variant_sparse_hash_shard_count backward compatibility.
- Output Math.max(1, variantSparseHashShardCount) in toSql() to
handle old data where this parameter defaults to 0.
5. Testing: Fix doc-mode test stability.
- Add explicit `set default_variant_max_subcolumns_count = 0` in
doc-mode test suites to prevent fuzzy randomization interference.
- Update .out files for variant_max_subcolumns_count and
variant_sparse_hash_shard_count output changes.
- Reset enable_inverted_index_v1_for_variant in var_index test.
[fix](variant) Fix Rowset::tablet_schema() in BE_TEST mode to respect set_tablet_schema updates
Fix the cascaded compaction bug where col_uid=-1 caused NOT_FOUND errors.
In BE_TEST mode, Rowset::tablet_schema() returned the stale _schema snapshot
instead of _rowset_meta->tablet_schema(), ignoring updates from
set_tablet_schema(copy_without_variant_extracted_columns()).
Also add comprehensive segment write/read and compaction UT cases:
- Doc-value and subcolumn mode write/read verification
- Multi-segment rowset read
- Mixed JSON shapes (empty, nested, sparse, null/bool/string)
- Single row minimal JSON
- Large JSON (5000 keys)
- Cross-rowset disjoint keys (2000+2000 > 2048 threshold) read and compact
- Cascaded compaction re-verification
format
[feature](variant) Add doc mode sparse column check and regression test
1. Add sparse column invariant check in VariantColumnWriterImpl::finalize()
- In doc mode, all data should be encoded as doc-value buckets or
materialized subcolumns. Sparse column data indicates an upstream bug.
- Returns InternalError if sparse entries are found during doc mode write.
2. Add regression test test_doc_mode_mixed_key_count.groovy
- Tests doc mode behavior with different key counts vs threshold (count=5)
- 8 test cases: keys < count, keys > count, keys = count (boundary),
mixed batches (lt->gt, gt->lt), multi-insert compaction,
alternating above/below threshold
- Each case verifies data correctness before and after compaction
3. Fix ParseConfig::enable_flatten_nested -> deprecated_enable_flatten_nested
- Rename in variant_mixed_compaction_test.cpp (12 occurrences)
- Rename in variant_util_test.cpp (21 occurrences)
- Rename in variant_max_subcolumns_cross_read_test.cpp (1 occurrence)
- Required after rebase onto upstream/master
[fix](variant) Fix doc mode sparse bug and _prev_positions UAF in insert_range_from
1. Fix doc mode INSERT INTO SELECT: when subcolumns exceed max_subcolumns_count
across rowsets, sparse data was not properly converted to doc_value format.
- Add convert_subcolumns_and_sparse_to_doc_value() to handle both subcolumns
and sparse data conversion with correct path ordering.
- Fix insert_range_from scenarios: src(S+sparse)->dst(D) and src(D)->dst(S+sparse).
- Fix variant_util parse layer: when doc mode is enabled and variant has sparse
data, convert all data to doc_value column.
2. Fix heap-use-after-free (ASAN): convert_subcolumns_to_doc_value() and
convert_subcolumns_and_sparse_to_doc_value() destroy old SubcolumnsTree nodes
via 'subcolumns = Subcolumns()', but _prev_positions cache still holds raw
pointers to destroyed Subcolumn objects. Add _prev_positions.clear() after
tree replacement to invalidate stale cache entries.
3. Add UT: convert_subcolumns_to_doc_value_cache_invalidation reproduces the
ASAN UAF when _prev_positions.clear() is missing.
4. Add regression test: test_doc_mode_downgrade_insert_into with 9 test cases
covering all src/dst structure combinations, mixed types, multi-round INSERT,
JSON completeness, and compaction.
5. Remove debug LOG(INFO) statements.
Fix variant column inverted index being lost during compaction when using field_pattern indexes. In the all_downgraded branch of get_extended_compaction_schema, typed columns were not correctly registered, causing field_pattern indexes to be dropped from the compaction schema. Changes: - variant_util.cpp: ensure get_compaction_typed_columns is called in the all_downgraded branch to correctly propagate field_pattern indexes. - variant_mixed_compaction_test.cpp: refactor create_rowset to use the correct vertical compaction API (add_columns/flush_columns/final_flush), add 3 new test cases verifying inverted index generation across all_downgraded, all_doc_value, and mixed mode compaction scenarios, using IndexFileReader for physical-level index verification. - Add regression test output for test_doc_mode_mixed_key_count.
c68a52c to
d55cb08
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)