Skip to content

[feature](filecache) persist table/partition context for cache meta for external table#61518

Open
freemandealer wants to merge 1 commit intoapache:masterfrom
freemandealer:observe-lakehouse
Open

[feature](filecache) persist table/partition context for cache meta for external table#61518
freemandealer wants to merge 1 commit intoapache:masterfrom
freemandealer:observe-lakehouse

Conversation

@freemandealer
Copy link
Copy Markdown
Contributor

MOTIVATION
While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO.

FE DESIGN

  • Extend thrift scan descriptors with optional file cache context fields:
    • TFileRangeDesc.table_name / partition_name
    • TPaloScanRange.partition_name
    • TOlapScanNode.partition_name
  • Centralize external table context filling in FileScanNode:
    • use table.getNameWithFullQualifiers() as the stable table identifier
    • build partition_name from partition path key/value pairs in a shared helper
    • reuse the helper from FileQueryScanNode, FileGroupInfo, and NereidsFileGroupInfo
  • Propagate internal table context from OlapScanNode:
    • set table_name on both scan node thrift and per-tablet scan ranges
    • set partition_name when the scan targets a single partition
    • keep partition_name empty for multi-partition scans
  • Extend the FILE_CACHE_INFO schema with TABLE_NAME and PARTITION_NAME columns

BE DESIGN

  • Extend io::IOContext and file cache context objects with table_name and partition_name, so the information is available on all file-cache-aware read paths
  • Propagate context through:
    • FileScanner for external table scans
    • OlapScanner / TabletReader / RowsetReaderContext for internal table scans
    • the parallel internal scan path, so context is preserved even when scanners are split by tablet or segment
  • Make FileScanner derive partition_name from columns_from_path when FE does not send an explicit partition string, and clear stale range context when a later range omits table metadata
  • Persist block metadata in RocksDB by storing only a context_id in CacheBlockMetaPb instead of duplicating table/partition strings per block
  • Add a dedicated RocksDB column family as a dictionary:
    • (table_name, partition_name) -> context_id
    • context_id -> (table_name, partition_name)
  • Resolve context_id back to strings only when serving FILE_CACHE_INFO, so the normal data read path stays lightweight
  • Update non-scan IOContext aggregate initializers to explicitly keep empty table/partition context, preserving existing behavior while keeping the build clean under strict initializer warnings

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…or external table

MOTIVATION
While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO.

FE DESIGN
  - Extend thrift scan descriptors with optional file cache context fields:
    - TFileRangeDesc.table_name / partition_name
    - TPaloScanRange.partition_name
    - TOlapScanNode.partition_name
  - Centralize external table context filling in FileScanNode:
    - use table.getNameWithFullQualifiers() as the stable table identifier
    - build partition_name from partition path key/value pairs in a shared helper
    - reuse the helper from FileQueryScanNode, FileGroupInfo, and NereidsFileGroupInfo
  - Propagate internal table context from OlapScanNode:
    - set table_name on both scan node thrift and per-tablet scan ranges
    - set partition_name when the scan targets a single partition
    - keep partition_name empty for multi-partition scans
  - Extend the FILE_CACHE_INFO schema with TABLE_NAME and PARTITION_NAME columns

BE DESIGN
  - Extend io::IOContext and file cache context objects with table_name and
    partition_name, so the information is available on all file-cache-aware read
    paths
  - Propagate context through:
    - FileScanner for external table scans
    - OlapScanner / TabletReader / RowsetReaderContext for internal table scans
    - the parallel internal scan path, so context is preserved even when scanners
      are split by tablet or segment
  - Make FileScanner derive partition_name from columns_from_path when FE does not
    send an explicit partition string, and clear stale range context when a later
    range omits table metadata
  - Persist block metadata in RocksDB by storing only a context_id in
    CacheBlockMetaPb instead of duplicating table/partition strings per block
  - Add a dedicated RocksDB column family as a dictionary:
    - (table_name, partition_name) -> context_id
    - context_id -> (table_name, partition_name)
  - Resolve context_id back to strings only when serving FILE_CACHE_INFO, so the
    normal data read path stays lightweight
  - Update non-scan IOContext aggregate initializers to explicitly keep empty
    table/partition context, preserving existing behavior while keeping the build
    clean under strict initializer warnings

Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman self-assigned this Mar 20, 2026
Copy link
Copy Markdown
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the table/partition context is purely for observability (i.e., enriching FILE_CACHE_INFO output), it should be treated as a best-effort operation. If RocksDB reads or writes to the context dictionary fail, they should never block or fail the normal data read/write path.
Please confirm it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_name can be put in scan node level? Not file range level.
Because these will be so many split(file range)

status = _db->Write(rocksdb::WriteOptions(), &batch);
if (!status.ok()) {
LOG(WARNING) << "Failed to create context id in rocksdb: " << status.ToString();
_next_context_id.store(context_id, std::memory_order_release);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_next_context_id Rollback Risk.

Problem: If thread A does fetch_add and gets id=5, then thread B does fetch_add and gets id=6 and successfully writes, and then thread A's write fails and executes store(5), _next_context_id is rolled back to 5. The next allocation will yield 5 again, causing an id collision.

Suggestion: Do not roll back _next_context_id. Wasting a single id is perfectly acceptable.

@@ -414,6 +398,16 @@ Status OlapScanner::_init_tablet_reader_params(

_tablet_reader_params.profile = _local_state->custom_profile();
_tablet_reader_params.runtime_state = _state;
{
auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state;
auto* olap_scan_local_state = static_cast<pipeline::OlapScanLocalState*>(_local_state);

@@ -884,6 +850,12 @@ void FSFileCacheStorage::load_cache_info_into_memory_from_db(BlockFileCache* mgr
ctx.expiration_time = meta_value.ttl;
ctx.tablet_id =
meta_key.tablet_id; //TODO(zhengyu): zero if loaded from v2, we can use this to decide whether the block is loaded from v2 or v3
if (meta_value.context_id != 0) {
if (auto context = _meta_store->get_context(meta_value.context_id); context) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In load_cache_info_into_memory_from_db, each block calls _meta_store->get_context(xxx), resulting in many redundant lookup for the same context_id.

I suggest we can use a map to cache it:

std::unordered_map<uint64_t, std::pair<std::string, std::string>> context_cache;
// ...
if (meta_value.context_id != 0) {
    auto it = context_cache.find(meta_value.context_id);
    if (it != context_cache.end()) {
        ctx.table_name = it->second.first;
        ctx.partition_name = it->second.second;
    } else if (auto context = _meta_store->get_context(meta_value.context_id); context) {
        ctx.table_name = context->first;
        ctx.partition_name = context->second;
        context_cache.emplace(meta_value.context_id, *context);
    }
}```

@morningman
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR propagates table/partition context from FE to BE for file cache metadata, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO. The design is generally sound — the context_id dictionary approach avoids duplicating strings per cache block. However, I found several issues ranging from a potential null pointer dereference to race-condition fragility and code quality concerns.

Critical Findings

  1. Inconsistent null guard in finalize() (fs_file_cache_storage.cpp:206): The new code guards get_or_create_context_id with if (_meta_store) but then calls _meta_store->put() unconditionally on the next line. While _meta_store is always initialized in init() today, this inconsistency is a code smell and would cause a null dereference if the guard was added for a reason.

  2. Fragile _next_context_id rollback (cache_block_meta_store.cpp:480): On write failure, _next_context_id.store(context_id, ...) rolls back the counter. This is only safe because the entire allocate+write sequence is under _context_mutex. However, the atomic is accessed outside the mutex (in _load_next_context_id), making the design fragile for future changes. If a concurrent fetch_add happened between the increment and rollback, the store would overwrite the other thread's allocation.

  3. C-style cast (olap_scanner.cpp:402): Uses (pipeline::OlapScanLocalState*)_local_state instead of static_cast. The same file already uses static_cast in other places (lines 111, 662). C-style casts bypass type safety and should be avoided in new code.

  4. std::string in hot-path structs: Adding std::string table_name and partition_name to KeyMeta, IOContext, and CacheContext adds heap allocation overhead on every file cache operation. The dictionary design in BlockMeta (using context_id) avoids this for persistence, but the in-memory path still copies strings throughout the read pipeline.

Non-blocking Observations

  • FE partition_name inconsistency: TOlapScanNode.partition_name is empty when scanning multiple partitions, while TPaloScanRange.partition_name is set per-tablet. The BE fallback in _init_tablet_reader_params (lines 401-409) uses the scan node value only when the per-range value is empty, so this works correctly but the interaction is subtle.
  • buildPartitionName silent truncation: When partitionKeys and partitionValues have mismatched sizes, Math.min() silently truncates. This could produce incomplete partition names.
  • Schema ordering is consistent: FE SchemaTable.java and BE schema_file_cache_info_scanner.cpp column orderings match correctly.
  • Tests are present: Unit tests cover the context dictionary round-trip, persistence across reopen, serialization, and clear. FE tests cover buildPartitionName and fillTablePartitionContext.

}
BlockMeta meta(key.meta.type, size, key.meta.expiration_time, context_id);
_meta_store->put(mkey, meta);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (inconsistent null guard): Line 201 guards get_or_create_context_id with if (_meta_store), but this line calls _meta_store->put() unconditionally. If _meta_store can be null (which the guard implies), this is a null pointer dereference. Either remove the guard on line 201 (since _meta_store is always initialized in init()) or also guard this call:

if (_meta_store) {
    _meta_store->put(mkey, meta);
}

_key_ranges(std::move(params.key_ranges)),
_tablet_reader_params({.tablet = std::move(params.tablet),
.tablet_schema {},
.aggregation = params.aggregation,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: This uses a C-style cast (pipeline::OlapScanLocalState*)_local_state. The same file uses static_cast in other places (e.g., line 111). Please use static_cast<pipeline::OlapScanLocalState*>(_local_state) for type safety.

uint64_t value = 0;
for (size_t i = 0; i < sizeof(uint64_t); ++i) {
value = (value << 8) | static_cast<uint64_t>(data[i]);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile rollback: On write failure, this rolls back _next_context_id with store(). This is only correct because the entire allocate+write path is serialized under _context_mutex. However, _next_context_id is a std::atomic which implies it might be accessed concurrently. If a second thread called fetch_add between line 470 and this rollback, the store would overwrite the second thread's allocation.

Consider either:

  1. Documenting that _next_context_id is only modified under _context_mutex, or
  2. Using compare_exchange_strong for a safer rollback: _next_context_id.compare_exchange_strong(expected, context_id) where expected = context_id + 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants