[feature](filecache) persist table/partition context for cache meta for external table by freemandealer · Pull Request #61518 · apache/doris

freemandealer · 2026-03-19T08:15:43Z

MOTIVATION
While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO.

FE DESIGN

Extend thrift scan descriptors with optional file cache context fields:
- TFileRangeDesc.table_name / partition_name
- TPaloScanRange.partition_name
- TOlapScanNode.partition_name
Centralize external table context filling in FileScanNode:
- use table.getNameWithFullQualifiers() as the stable table identifier
- build partition_name from partition path key/value pairs in a shared helper
- reuse the helper from FileQueryScanNode, FileGroupInfo, and NereidsFileGroupInfo
Propagate internal table context from OlapScanNode:
- set table_name on both scan node thrift and per-tablet scan ranges
- set partition_name when the scan targets a single partition
- keep partition_name empty for multi-partition scans
Extend the FILE_CACHE_INFO schema with TABLE_NAME and PARTITION_NAME columns

BE DESIGN

Extend io::IOContext and file cache context objects with table_name and partition_name, so the information is available on all file-cache-aware read paths
Propagate context through:
- FileScanner for external table scans
- OlapScanner / TabletReader / RowsetReaderContext for internal table scans
- the parallel internal scan path, so context is preserved even when scanners are split by tablet or segment
Make FileScanner derive partition_name from columns_from_path when FE does not send an explicit partition string, and clear stale range context when a later range omits table metadata
Persist block metadata in RocksDB by storing only a context_id in CacheBlockMetaPb instead of duplicating table/partition strings per block
Add a dedicated RocksDB column family as a dictionary:
- (table_name, partition_name) -> context_id
- context_id -> (table_name, partition_name)
Resolve context_id back to strings only when serving FILE_CACHE_INFO, so the normal data read path stays lightweight
Update non-scan IOContext aggregate initializers to explicitly keep empty table/partition context, preserving existing behavior while keeping the build clean under strict initializer warnings

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…or external table MOTIVATION While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO. FE DESIGN - Extend thrift scan descriptors with optional file cache context fields: - TFileRangeDesc.table_name / partition_name - TPaloScanRange.partition_name - TOlapScanNode.partition_name - Centralize external table context filling in FileScanNode: - use table.getNameWithFullQualifiers() as the stable table identifier - build partition_name from partition path key/value pairs in a shared helper - reuse the helper from FileQueryScanNode, FileGroupInfo, and NereidsFileGroupInfo - Propagate internal table context from OlapScanNode: - set table_name on both scan node thrift and per-tablet scan ranges - set partition_name when the scan targets a single partition - keep partition_name empty for multi-partition scans - Extend the FILE_CACHE_INFO schema with TABLE_NAME and PARTITION_NAME columns BE DESIGN - Extend io::IOContext and file cache context objects with table_name and partition_name, so the information is available on all file-cache-aware read paths - Propagate context through: - FileScanner for external table scans - OlapScanner / TabletReader / RowsetReaderContext for internal table scans - the parallel internal scan path, so context is preserved even when scanners are split by tablet or segment - Make FileScanner derive partition_name from columns_from_path when FE does not send an explicit partition string, and clear stale range context when a later range omits table metadata - Persist block metadata in RocksDB by storing only a context_id in CacheBlockMetaPb instead of duplicating table/partition strings per block - Add a dedicated RocksDB column family as a dictionary: - (table_name, partition_name) -> context_id - context_id -> (table_name, partition_name) - Resolve context_id back to strings only when serving FILE_CACHE_INFO, so the normal data read path stays lightweight - Update non-scan IOContext aggregate initializers to explicitly keep empty table/partition context, preserving existing behavior while keeping the build clean under strict initializer warnings Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>

hello-stephen · 2026-03-19T08:15:49Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

morningman

Since the table/partition context is purely for observability (i.e., enriching FILE_CACHE_INFO output), it should be treated as a best-effort operation. If RocksDB reads or writes to the context dictionary fail, they should never block or fail the normal data read/write path.
Please confirm it

morningman · 2026-03-20T23:13:15Z

gensrc/thrift/PlanNodes.thrift

table_name can be put in scan node level? Not file range level.
Because these will be so many split(file range)

morningman · 2026-03-20T23:18:03Z

be/src/io/cache/cache_block_meta_store.cpp

+    status = _db->Write(rocksdb::WriteOptions(), &batch);
+    if (!status.ok()) {
+        LOG(WARNING) << "Failed to create context id in rocksdb: " << status.ToString();
+        _next_context_id.store(context_id, std::memory_order_release);


_next_context_id Rollback Risk.

Problem: If thread A does fetch_add and gets id=5, then thread B does fetch_add and gets id=6 and successfully writes, and then thread A's write fails and executes store(5), _next_context_id is rolled back to 5. The next allocation will yield 5 again, causing an id collision.

Suggestion: Do not roll back _next_context_id. Wasting a single id is perfectly acceptable.

morningman · 2026-03-20T23:18:20Z

be/src/vec/exec/scan/olap_scanner.cpp

@@ -414,6 +398,16 @@ Status OlapScanner::_init_tablet_reader_params(

    _tablet_reader_params.profile = _local_state->custom_profile();
    _tablet_reader_params.runtime_state = _state;
+    {
+        auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state;


Suggested change

auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state;

auto* olap_scan_local_state = static_cast<pipeline::OlapScanLocalState*>(_local_state);

morningman · 2026-03-20T23:21:16Z

be/src/io/cache/fs_file_cache_storage.cpp

@@ -884,6 +850,12 @@ void FSFileCacheStorage::load_cache_info_into_memory_from_db(BlockFileCache* mgr
        ctx.expiration_time = meta_value.ttl;
        ctx.tablet_id =
                meta_key.tablet_id; //TODO(zhengyu): zero if loaded from v2, we can use this to decide whether the block is loaded from v2 or v3
+        if (meta_value.context_id != 0) {
+            if (auto context = _meta_store->get_context(meta_value.context_id); context) {


In load_cache_info_into_memory_from_db, each block calls _meta_store->get_context(xxx), resulting in many redundant lookup for the same context_id.

I suggest we can use a map to cache it:

std::unordered_map<uint64_t, std::pair<std::string, std::string>> context_cache; // ... if (meta_value.context_id != 0) { auto it = context_cache.find(meta_value.context_id); if (it != context_cache.end()) { ctx.table_name = it->second.first; ctx.partition_name = it->second.second; } else if (auto context = _meta_store->get_context(meta_value.context_id); context) { ctx.table_name = context->first; ctx.partition_name = context->second; context_cache.emplace(meta_value.context_id, *context); } }```

morningman · 2026-03-20T23:24:24Z

/review

github-actions

Code Review Summary

This PR propagates table/partition context from FE to BE for file cache metadata, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO. The design is generally sound — the context_id dictionary approach avoids duplicating strings per cache block. However, I found several issues ranging from a potential null pointer dereference to race-condition fragility and code quality concerns.

Critical Findings

Inconsistent null guard in finalize() (fs_file_cache_storage.cpp:206): The new code guards get_or_create_context_id with if (_meta_store) but then calls _meta_store->put() unconditionally on the next line. While _meta_store is always initialized in init() today, this inconsistency is a code smell and would cause a null dereference if the guard was added for a reason.
Fragile _next_context_id rollback (cache_block_meta_store.cpp:480): On write failure, _next_context_id.store(context_id, ...) rolls back the counter. This is only safe because the entire allocate+write sequence is under _context_mutex. However, the atomic is accessed outside the mutex (in _load_next_context_id), making the design fragile for future changes. If a concurrent fetch_add happened between the increment and rollback, the store would overwrite the other thread's allocation.
C-style cast (olap_scanner.cpp:402): Uses (pipeline::OlapScanLocalState*)_local_state instead of static_cast. The same file already uses static_cast in other places (lines 111, 662). C-style casts bypass type safety and should be avoided in new code.
std::string in hot-path structs: Adding std::string table_name and partition_name to KeyMeta, IOContext, and CacheContext adds heap allocation overhead on every file cache operation. The dictionary design in BlockMeta (using context_id) avoids this for persistence, but the in-memory path still copies strings throughout the read pipeline.

Non-blocking Observations

FE partition_name inconsistency: TOlapScanNode.partition_name is empty when scanning multiple partitions, while TPaloScanRange.partition_name is set per-tablet. The BE fallback in _init_tablet_reader_params (lines 401-409) uses the scan node value only when the per-range value is empty, so this works correctly but the interaction is subtle.
buildPartitionName silent truncation: When partitionKeys and partitionValues have mismatched sizes, Math.min() silently truncates. This could produce incomplete partition names.
Schema ordering is consistent: FE SchemaTable.java and BE schema_file_cache_info_scanner.cpp column orderings match correctly.
Tests are present: Unit tests cover the context dictionary round-trip, persistence across reopen, serialization, and clear. FE tests cover buildPartitionName and fillTablePartitionContext.

github-actions · 2026-03-20T23:34:40Z

be/src/io/cache/fs_file_cache_storage.cpp

+    }
+    BlockMeta meta(key.meta.type, size, key.meta.expiration_time, context_id);
    _meta_store->put(mkey, meta);



Bug (inconsistent null guard): Line 201 guards get_or_create_context_id with if (_meta_store), but this line calls _meta_store->put() unconditionally. If _meta_store can be null (which the guard implies), this is a null pointer dereference. Either remove the guard on line 201 (since _meta_store is always initialized in init()) or also guard this call:

if (_meta_store) { _meta_store->put(mkey, meta); }

github-actions · 2026-03-20T23:34:40Z

be/src/vec/exec/scan/olap_scanner.cpp

          _key_ranges(std::move(params.key_ranges)),
-          _tablet_reader_params({.tablet = std::move(params.tablet),
-                                 .tablet_schema {},
-                                 .aggregation = params.aggregation,


Style: This uses a C-style cast (pipeline::OlapScanLocalState*)_local_state. The same file uses static_cast in other places (e.g., line 111). Please use static_cast<pipeline::OlapScanLocalState*>(_local_state) for type safety.

github-actions · 2026-03-20T23:34:40Z

be/src/io/cache/cache_block_meta_store.cpp

+    uint64_t value = 0;
+    for (size_t i = 0; i < sizeof(uint64_t); ++i) {
+        value = (value << 8) | static_cast<uint64_t>(data[i]);
+    }


Fragile rollback: On write failure, this rolls back _next_context_id with store(). This is only correct because the entire allocate+write path is serialized under _context_mutex. However, _next_context_id is a std::atomic which implies it might be accessed concurrently. If a second thread called fetch_add between line 470 and this rollback, the store would overwrite the second thread's allocation.

Consider either:

Documenting that _next_context_id is only modified under _context_mutex, or

Using compare_exchange_strong for a safer rollback: _next_context_id.compare_exchange_strong(expected, context_id) where expected = context_id + 1.

morningman self-assigned this Mar 20, 2026

morningman reviewed Mar 20, 2026

View reviewed changes

github-actions bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature](filecache) persist table/partition context for cache meta for external table#61518

[feature](filecache) persist table/partition context for cache meta for external table#61518
freemandealer wants to merge 1 commit intoapache:masterfrom
freemandealer:observe-lakehouse

freemandealer commented Mar 19, 2026

Uh oh!

hello-stephen commented Mar 19, 2026

Uh oh!

morningman left a comment

Uh oh!

morningman Mar 20, 2026

Uh oh!

morningman Mar 20, 2026

Uh oh!

morningman Mar 20, 2026

Uh oh!

morningman Mar 20, 2026

Uh oh!

morningman commented Mar 20, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Mar 20, 2026

Uh oh!

github-actions bot Mar 20, 2026

Uh oh!

github-actions bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state;
	auto* olap_scan_local_state = static_cast<pipeline::OlapScanLocalState*>(_local_state);

Conversation

freemandealer commented Mar 19, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Mar 19, 2026

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

morningman Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

morningman Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

morningman Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

morningman Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

morningman commented Mar 20, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Critical Findings

Non-blocking Observations

Uh oh!

github-actions bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants