Skip to content

[Enhancement](variant) Support mixed subcolumn/doc-value in doc mode#61539

Open
csun5285 wants to merge 3 commits intoapache:masterfrom
csun5285:feature/variant-new-feature
Open

[Enhancement](variant) Support mixed subcolumn/doc-value in doc mode#61539
csun5285 wants to merge 3 commits intoapache:masterfrom
csun5285:feature/variant-new-feature

Conversation

@csun5285
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 20, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Contributor Author

run buildall

1 similar comment
@csun5285
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.38% (32272/50130)
Region Coverage 65.27% (16162/24760)
Branch Coverage 55.71% (8611/15456)

@doris-robot
Copy link

TPC-H: Total hot run time: 27003 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a9bc670d05a1bbfb01b2e3995df970dca1c7938c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17632	4451	4348	4348
q2	q3	10727	844	521	521
q4	4716	361	252	252
q5	8107	1211	1002	1002
q6	246	176	148	148
q7	830	861	665	665
q8	10696	1496	1373	1373
q9	6325	4768	4725	4725
q10	6443	1931	1659	1659
q11	480	271	248	248
q12	765	581	467	467
q13	18061	2974	2161	2161
q14	231	244	217	217
q15	q16	784	744	674	674
q17	753	866	440	440
q18	5904	5439	5294	5294
q19	1219	1000	629	629
q20	544	505	374	374
q21	4756	1967	1527	1527
q22	389	363	279	279
Total cold run time: 99608 ms
Total hot run time: 27003 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4893	4575	4576	4575
q2	q3	3923	4350	3813	3813
q4	863	1203	789	789
q5	4075	4418	4342	4342
q6	188	178	141	141
q7	1754	1649	1570	1570
q8	2556	2749	2614	2614
q9	7600	7399	7327	7327
q10	3833	4030	3658	3658
q11	513	447	495	447
q12	529	583	437	437
q13	2738	3116	2277	2277
q14	342	459	374	374
q15	q16	740	804	739	739
q17	1209	1402	1414	1402
q18	7062	6751	6733	6733
q19	923	884	1003	884
q20	2110	2171	1957	1957
q21	4044	3614	3384	3384
q22	477	431	401	401
Total cold run time: 50372 ms
Total hot run time: 47864 ms

@csun5285 csun5285 force-pushed the feature/variant-new-feature branch from a9bc670 to 98c1b07 Compare March 23, 2026 08:19
@csun5285
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.38% (32275/50130)
Region Coverage 65.28% (16163/24760)
Branch Coverage 55.69% (8608/15456)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 14.29% (1/7) 🎉
Increment coverage report
Complete coverage report

@csun5285
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.46% (32312/50130)
Region Coverage 65.32% (16172/24760)
Branch Coverage 55.80% (8624/15456)

@doris-robot
Copy link

TPC-H: Total hot run time: 26947 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c68a52ce40c518f8f090ece5fe3dcd70d701ddaf, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17643	4689	4370	4370
q2	q3	10582	770	520	520
q4	4684	361	254	254
q5	7573	1210	1000	1000
q6	176	170	144	144
q7	764	836	671	671
q8	9298	1493	1333	1333
q9	4971	4811	4707	4707
q10	6305	1918	1686	1686
q11	464	240	242	240
q12	746	580	458	458
q13	18065	2921	2173	2173
q14	233	236	213	213
q15	q16	748	744	675	675
q17	738	788	510	510
q18	6233	5441	5256	5256
q19	1127	991	616	616
q20	538	478	378	378
q21	4453	1822	1437	1437
q22	494	358	306	306
Total cold run time: 95835 ms
Total hot run time: 26947 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4712	4529	4592	4529
q2	q3	3957	4415	3878	3878
q4	883	1207	777	777
q5	4031	4415	4307	4307
q6	186	177	139	139
q7	1842	1700	1561	1561
q8	2514	2746	2565	2565
q9	7641	7426	7558	7426
q10	3761	3990	3595	3595
q11	510	422	408	408
q12	496	649	436	436
q13	2740	3288	2318	2318
q14	329	311	278	278
q15	q16	757	808	728	728
q17	1184	1375	1338	1338
q18	7327	6700	6679	6679
q19	944	912	884	884
q20	2109	2203	2095	2095
q21	4010	3499	3366	3366
q22	500	439	379	379
Total cold run time: 50433 ms
Total hot run time: 47686 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168184 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c68a52ce40c518f8f090ece5fe3dcd70d701ddaf, data reload: false

query5	4364	648	496	496
query6	330	224	202	202
query7	4215	468	265	265
query8	343	241	225	225
query9	8718	2685	2719	2685
query10	514	413	341	341
query11	7049	5072	4858	4858
query12	181	125	119	119
query13	1260	450	341	341
query14	5726	3775	3418	3418
query14_1	2836	2797	2834	2797
query15	217	207	179	179
query16	958	472	436	436
query17	875	703	593	593
query18	2418	476	338	338
query19	212	213	178	178
query20	132	125	123	123
query21	209	130	109	109
query22	13287	13963	14987	13963
query23	16476	15867	15549	15549
query23_1	15610	16095	15649	15649
query24	7604	1635	1238	1238
query24_1	1239	1226	1255	1226
query25	568	489	435	435
query26	1241	273	158	158
query27	2771	491	300	300
query28	4499	1859	1850	1850
query29	873	580	473	473
query30	291	223	188	188
query31	1010	954	876	876
query32	84	71	68	68
query33	502	321	283	283
query34	900	868	520	520
query35	644	679	607	607
query36	1085	1078	918	918
query37	132	93	86	86
query38	3033	2937	2870	2870
query39	868	832	808	808
query39_1	797	800	789	789
query40	234	154	133	133
query41	64	59	57	57
query42	258	256	256	256
query43	259	252	214	214
query44	
query45	205	190	186	186
query46	871	973	605	605
query47	2092	2117	2067	2067
query48	317	320	226	226
query49	634	456	377	377
query50	677	281	215	215
query51	4071	4061	3955	3955
query52	259	268	257	257
query53	297	339	284	284
query54	306	279	259	259
query55	99	81	89	81
query56	305	313	359	313
query57	1910	1697	1709	1697
query58	286	277	275	275
query59	2813	2955	2755	2755
query60	339	338	322	322
query61	160	156	181	156
query62	626	580	552	552
query63	313	285	271	271
query64	5022	1285	1009	1009
query65	
query66	1465	451	349	349
query67	24213	24294	24245	24245
query68	
query69	407	321	289	289
query70	966	980	930	930
query71	353	311	301	301
query72	2853	2896	2644	2644
query73	556	549	326	326
query74	9654	9542	9302	9302
query75	2858	2774	2492	2492
query76	2298	1043	688	688
query77	379	392	325	325
query78	10869	11140	10443	10443
query79	1104	785	586	586
query80	892	656	559	559
query81	516	260	232	232
query82	1349	154	124	124
query83	351	279	254	254
query84	300	126	103	103
query85	964	512	454	454
query86	391	335	288	288
query87	3183	3112	3034	3034
query88	3636	2698	2686	2686
query89	430	375	349	349
query90	1877	186	176	176
query91	170	164	145	145
query92	80	77	72	72
query93	933	843	490	490
query94	508	280	299	280
query95	575	341	381	341
query96	661	538	244	244
query97	2457	2462	2396	2396
query98	227	217	214	214
query99	973	978	937	937
Total cold run time: 248876 ms
Total hot run time: 168184 ms

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 14.29% (1/7) 🎉
Increment coverage report
Complete coverage report

…nd variant_max_subcolumns_count in doc mode

1. BE: Support mixed-format compaction for doc-mode variant columns.
   - Add SubcolumnToDocCompactIterator to convert subcolumn-only
     segments into doc-value format during compaction read.
   - Add SUBCOLUMN_TO_DOC ReadKind in variant_column_reader to handle
     segments that have subcolumn data but need to be read as doc-value
     buckets (e.g. when max_subcolumns_count > actual path count).
   - Detect actual segment storage format (has_doc_value_segments) in
     aggregate_variant_extended_info to decide whether to output
     doc-value bucket columns or subcolumn columns in compaction schema.
   - Support ColumnVariant::insert_from/insert_range_from between
     doc-value mode and subcolumn mode by auto-converting formats.

2. BE: Refactor parse_json_to_variant to support max_subcolumns_count.
   - Split into parse_json_to_result() + insert_parse_result_into_variant()
     so parsed results can be inspected before insertion.
   - Add max_subcolumns_count to ParseConfig to control doc-mode
     threshold at parse time.

3. FE: Allow variant_max_subcolumns_count in doc mode.
   - Remove validation that prevented variant_max_subcolumns_count
     and variant_enable_doc_mode from being set together.
   - Output variant_max_subcolumns_count in VariantType.toSql() when
     doc mode is enabled (both nereids and fe-type).
   - Randomize defaultVariantMaxSubcolumnsCount in fuzzy test mode
     when doc mode is enabled.

4. FE: Fix variant_sparse_hash_shard_count backward compatibility.
   - Output Math.max(1, variantSparseHashShardCount) in toSql() to
     handle old data where this parameter defaults to 0.

5. Testing: Fix doc-mode test stability.
   - Add explicit `set default_variant_max_subcolumns_count = 0` in
     doc-mode test suites to prevent fuzzy randomization interference.
   - Update .out files for variant_max_subcolumns_count and
     variant_sparse_hash_shard_count output changes.
   - Reset enable_inverted_index_v1_for_variant in var_index test.

[fix](variant) Fix Rowset::tablet_schema() in BE_TEST mode to respect set_tablet_schema updates

Fix the cascaded compaction bug where col_uid=-1 caused NOT_FOUND errors.
In BE_TEST mode, Rowset::tablet_schema() returned the stale _schema snapshot
instead of _rowset_meta->tablet_schema(), ignoring updates from
set_tablet_schema(copy_without_variant_extracted_columns()).

Also add comprehensive segment write/read and compaction UT cases:
- Doc-value and subcolumn mode write/read verification
- Multi-segment rowset read
- Mixed JSON shapes (empty, nested, sparse, null/bool/string)
- Single row minimal JSON
- Large JSON (5000 keys)
- Cross-rowset disjoint keys (2000+2000 > 2048 threshold) read and compact
- Cascaded compaction re-verification

format

[feature](variant) Add doc mode sparse column check and regression test

1. Add sparse column invariant check in VariantColumnWriterImpl::finalize()
   - In doc mode, all data should be encoded as doc-value buckets or
     materialized subcolumns. Sparse column data indicates an upstream bug.
   - Returns InternalError if sparse entries are found during doc mode write.

2. Add regression test test_doc_mode_mixed_key_count.groovy
   - Tests doc mode behavior with different key counts vs threshold (count=5)
   - 8 test cases: keys < count, keys > count, keys = count (boundary),
     mixed batches (lt->gt, gt->lt), multi-insert compaction,
     alternating above/below threshold
   - Each case verifies data correctness before and after compaction

3. Fix ParseConfig::enable_flatten_nested -> deprecated_enable_flatten_nested
   - Rename in variant_mixed_compaction_test.cpp (12 occurrences)
   - Rename in variant_util_test.cpp (21 occurrences)
   - Rename in variant_max_subcolumns_cross_read_test.cpp (1 occurrence)
   - Required after rebase onto upstream/master

[fix](variant) Fix doc mode sparse bug and _prev_positions UAF in insert_range_from

1. Fix doc mode INSERT INTO SELECT: when subcolumns exceed max_subcolumns_count
   across rowsets, sparse data was not properly converted to doc_value format.
   - Add convert_subcolumns_and_sparse_to_doc_value() to handle both subcolumns
     and sparse data conversion with correct path ordering.
   - Fix insert_range_from scenarios: src(S+sparse)->dst(D) and src(D)->dst(S+sparse).
   - Fix variant_util parse layer: when doc mode is enabled and variant has sparse
     data, convert all data to doc_value column.

2. Fix heap-use-after-free (ASAN): convert_subcolumns_to_doc_value() and
   convert_subcolumns_and_sparse_to_doc_value() destroy old SubcolumnsTree nodes
   via 'subcolumns = Subcolumns()', but _prev_positions cache still holds raw
   pointers to destroyed Subcolumn objects. Add _prev_positions.clear() after
   tree replacement to invalidate stale cache entries.

3. Add UT: convert_subcolumns_to_doc_value_cache_invalidation reproduces the
   ASAN UAF when _prev_positions.clear() is missing.

4. Add regression test: test_doc_mode_downgrade_insert_into with 9 test cases
   covering all src/dst structure combinations, mixed types, multi-round INSERT,
   JSON completeness, and compaction.

5. Remove debug LOG(INFO) statements.
Fix variant column inverted index being lost during compaction when using
field_pattern indexes. In the all_downgraded branch of
get_extended_compaction_schema, typed columns were not correctly registered,
causing field_pattern indexes to be dropped from the compaction schema.

Changes:
- variant_util.cpp: ensure get_compaction_typed_columns is called in the
  all_downgraded branch to correctly propagate field_pattern indexes.
- variant_mixed_compaction_test.cpp: refactor create_rowset to use the
  correct vertical compaction API (add_columns/flush_columns/final_flush),
  add 3 new test cases verifying inverted index generation across
  all_downgraded, all_doc_value, and mixed mode compaction scenarios,
  using IndexFileReader for physical-level index verification.
- Add regression test output for test_doc_mode_mixed_key_count.
@csun5285 csun5285 force-pushed the feature/variant-new-feature branch from c68a52c to d55cb08 Compare March 24, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants