[feature](filecache) persist table/partition context for cache meta for external table#61518
[feature](filecache) persist table/partition context for cache meta for external table#61518freemandealer wants to merge 1 commit intoapache:masterfrom
Conversation
…or external table
MOTIVATION
While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO.
FE DESIGN
- Extend thrift scan descriptors with optional file cache context fields:
- TFileRangeDesc.table_name / partition_name
- TPaloScanRange.partition_name
- TOlapScanNode.partition_name
- Centralize external table context filling in FileScanNode:
- use table.getNameWithFullQualifiers() as the stable table identifier
- build partition_name from partition path key/value pairs in a shared helper
- reuse the helper from FileQueryScanNode, FileGroupInfo, and NereidsFileGroupInfo
- Propagate internal table context from OlapScanNode:
- set table_name on both scan node thrift and per-tablet scan ranges
- set partition_name when the scan targets a single partition
- keep partition_name empty for multi-partition scans
- Extend the FILE_CACHE_INFO schema with TABLE_NAME and PARTITION_NAME columns
BE DESIGN
- Extend io::IOContext and file cache context objects with table_name and
partition_name, so the information is available on all file-cache-aware read
paths
- Propagate context through:
- FileScanner for external table scans
- OlapScanner / TabletReader / RowsetReaderContext for internal table scans
- the parallel internal scan path, so context is preserved even when scanners
are split by tablet or segment
- Make FileScanner derive partition_name from columns_from_path when FE does not
send an explicit partition string, and clear stale range context when a later
range omits table metadata
- Persist block metadata in RocksDB by storing only a context_id in
CacheBlockMetaPb instead of duplicating table/partition strings per block
- Add a dedicated RocksDB column family as a dictionary:
- (table_name, partition_name) -> context_id
- context_id -> (table_name, partition_name)
- Resolve context_id back to strings only when serving FILE_CACHE_INFO, so the
normal data read path stays lightweight
- Update non-scan IOContext aggregate initializers to explicitly keep empty
table/partition context, preserving existing behavior while keeping the build
clean under strict initializer warnings
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
morningman
left a comment
There was a problem hiding this comment.
Since the table/partition context is purely for observability (i.e., enriching FILE_CACHE_INFO output), it should be treated as a best-effort operation. If RocksDB reads or writes to the context dictionary fail, they should never block or fail the normal data read/write path.
Please confirm it
There was a problem hiding this comment.
table_name can be put in scan node level? Not file range level.
Because these will be so many split(file range)
| status = _db->Write(rocksdb::WriteOptions(), &batch); | ||
| if (!status.ok()) { | ||
| LOG(WARNING) << "Failed to create context id in rocksdb: " << status.ToString(); | ||
| _next_context_id.store(context_id, std::memory_order_release); |
There was a problem hiding this comment.
_next_context_id Rollback Risk.
Problem: If thread A does fetch_add and gets id=5, then thread B does fetch_add and gets id=6 and successfully writes, and then thread A's write fails and executes store(5), _next_context_id is rolled back to 5. The next allocation will yield 5 again, causing an id collision.
Suggestion: Do not roll back _next_context_id. Wasting a single id is perfectly acceptable.
| @@ -414,6 +398,16 @@ Status OlapScanner::_init_tablet_reader_params( | |||
|
|
|||
| _tablet_reader_params.profile = _local_state->custom_profile(); | |||
| _tablet_reader_params.runtime_state = _state; | |||
| { | |||
| auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state; | |||
There was a problem hiding this comment.
| auto* olap_scan_local_state = (pipeline::OlapScanLocalState*)_local_state; | |
| auto* olap_scan_local_state = static_cast<pipeline::OlapScanLocalState*>(_local_state); |
| @@ -884,6 +850,12 @@ void FSFileCacheStorage::load_cache_info_into_memory_from_db(BlockFileCache* mgr | |||
| ctx.expiration_time = meta_value.ttl; | |||
| ctx.tablet_id = | |||
| meta_key.tablet_id; //TODO(zhengyu): zero if loaded from v2, we can use this to decide whether the block is loaded from v2 or v3 | |||
| if (meta_value.context_id != 0) { | |||
| if (auto context = _meta_store->get_context(meta_value.context_id); context) { | |||
There was a problem hiding this comment.
In load_cache_info_into_memory_from_db, each block calls _meta_store->get_context(xxx), resulting in many redundant lookup for the same context_id.
I suggest we can use a map to cache it:
std::unordered_map<uint64_t, std::pair<std::string, std::string>> context_cache;
// ...
if (meta_value.context_id != 0) {
auto it = context_cache.find(meta_value.context_id);
if (it != context_cache.end()) {
ctx.table_name = it->second.first;
ctx.partition_name = it->second.second;
} else if (auto context = _meta_store->get_context(meta_value.context_id); context) {
ctx.table_name = context->first;
ctx.partition_name = context->second;
context_cache.emplace(meta_value.context_id, *context);
}
}```
|
/review |
There was a problem hiding this comment.
Code Review Summary
This PR propagates table/partition context from FE to BE for file cache metadata, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO. The design is generally sound — the context_id dictionary approach avoids duplicating strings per cache block. However, I found several issues ranging from a potential null pointer dereference to race-condition fragility and code quality concerns.
Critical Findings
-
Inconsistent null guard in
finalize()(fs_file_cache_storage.cpp:206): The new code guardsget_or_create_context_idwithif (_meta_store)but then calls_meta_store->put()unconditionally on the next line. While_meta_storeis always initialized ininit()today, this inconsistency is a code smell and would cause a null dereference if the guard was added for a reason. -
Fragile
_next_context_idrollback (cache_block_meta_store.cpp:480): On write failure,_next_context_id.store(context_id, ...)rolls back the counter. This is only safe because the entire allocate+write sequence is under_context_mutex. However, the atomic is accessed outside the mutex (in_load_next_context_id), making the design fragile for future changes. If a concurrentfetch_addhappened between the increment and rollback, thestorewould overwrite the other thread's allocation. -
C-style cast (
olap_scanner.cpp:402): Uses(pipeline::OlapScanLocalState*)_local_stateinstead ofstatic_cast. The same file already usesstatic_castin other places (lines 111, 662). C-style casts bypass type safety and should be avoided in new code. -
std::stringin hot-path structs: Addingstd::string table_nameandpartition_nametoKeyMeta,IOContext, andCacheContextadds heap allocation overhead on every file cache operation. The dictionary design inBlockMeta(usingcontext_id) avoids this for persistence, but the in-memory path still copies strings throughout the read pipeline.
Non-blocking Observations
- FE partition_name inconsistency:
TOlapScanNode.partition_nameis empty when scanning multiple partitions, whileTPaloScanRange.partition_nameis set per-tablet. The BE fallback in_init_tablet_reader_params(lines 401-409) uses the scan node value only when the per-range value is empty, so this works correctly but the interaction is subtle. buildPartitionNamesilent truncation: WhenpartitionKeysandpartitionValueshave mismatched sizes,Math.min()silently truncates. This could produce incomplete partition names.- Schema ordering is consistent: FE
SchemaTable.javaand BEschema_file_cache_info_scanner.cppcolumn orderings match correctly. - Tests are present: Unit tests cover the context dictionary round-trip, persistence across reopen, serialization, and clear. FE tests cover
buildPartitionNameandfillTablePartitionContext.
| } | ||
| BlockMeta meta(key.meta.type, size, key.meta.expiration_time, context_id); | ||
| _meta_store->put(mkey, meta); | ||
|
|
There was a problem hiding this comment.
Bug (inconsistent null guard): Line 201 guards get_or_create_context_id with if (_meta_store), but this line calls _meta_store->put() unconditionally. If _meta_store can be null (which the guard implies), this is a null pointer dereference. Either remove the guard on line 201 (since _meta_store is always initialized in init()) or also guard this call:
if (_meta_store) {
_meta_store->put(mkey, meta);
}| _key_ranges(std::move(params.key_ranges)), | ||
| _tablet_reader_params({.tablet = std::move(params.tablet), | ||
| .tablet_schema {}, | ||
| .aggregation = params.aggregation, |
There was a problem hiding this comment.
Style: This uses a C-style cast (pipeline::OlapScanLocalState*)_local_state. The same file uses static_cast in other places (e.g., line 111). Please use static_cast<pipeline::OlapScanLocalState*>(_local_state) for type safety.
| uint64_t value = 0; | ||
| for (size_t i = 0; i < sizeof(uint64_t); ++i) { | ||
| value = (value << 8) | static_cast<uint64_t>(data[i]); | ||
| } |
There was a problem hiding this comment.
Fragile rollback: On write failure, this rolls back _next_context_id with store(). This is only correct because the entire allocate+write path is serialized under _context_mutex. However, _next_context_id is a std::atomic which implies it might be accessed concurrently. If a second thread called fetch_add between line 470 and this rollback, the store would overwrite the second thread's allocation.
Consider either:
- Documenting that
_next_context_idis only modified under_context_mutex, or - Using
compare_exchange_strongfor a safer rollback:_next_context_id.compare_exchange_strong(expected, context_id)whereexpected = context_id + 1.
MOTIVATION
While internal table/partition info can be infered from tabletId, external table scans, where file cache entries cannot be traced back to the original table/partition in a consistent way. This change propagates file cache context from FE to BE, persists it in RocksDB using a deduplicated dictionary design, and exposes it through FILE_CACHE_INFO.
FE DESIGN
BE DESIGN
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)