[Cosmos] Add cold-start metadata cache cross-region hedging#47509
Draft
NaluTripician wants to merge 2 commits into
Draft
[Cosmos] Add cold-start metadata cache cross-region hedging#47509NaluTripician wants to merge 2 commits into
NaluTripician wants to merge 2 commits into
Conversation
Self-review (Seon thorough) fixes: - Gate hedging on an explicit metadataCachePopulation request flag set only by the container-properties refresh, cold partition-key read, and routing-map fetch callsites, so public container reads and hybrid-search PK reads are no longer hedged. - Size the hedge thread pool to max(cpu_count, 2*budget) so each in-flight hedge always has its primary+hedge threads. - Remove the dead record_failure no-op (metadata cache health is account-global, not per-partition). - Clarify the threshold rationale and the async budget non-blocking check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Python port of .NET PR Azure/azure-cosmos-dotnet-v3#5923 — cold-start metadata cache cross-region hedging.
When enabled, the SDK hedges the first-time population (and refresh) of the container (Collection) and partition-key-range metadata caches across regions: the primary request is dispatched immediately and, if it has not produced an acceptable response within a fixed SDK-derived threshold (1.5s), a single hedge request is dispatched to a second region. The first acceptable winner is returned. This prevents a slow/unavailable region from stalling client init and the first request.
Customer-facing API
New tri-state client keyword
enable_metadata_hedging_for_cold_start(sync + asyncCosmosClient):None(default) — follows the account's PPAF (Per-Partition Automatic Failover) state.True— hedge even when PPAF is disabled.False— disable regardless of PPAF.The threshold and concurrency budget are SDK-derived defaults and are not customer-configurable.
Scope (important)
Hedging is strictly limited to internal cold-start metadata-cache population reads. It is gated on an explicit internal
metadataCachePopulationrequest flag that is set by exactly three callsites:_refresh_container_properties_cache),_get_partition_key_definition),_fetch_routing_map).Public
container.read(), hybrid-search partition-key-range reads, andGetDatabaseAccountare not hedged.Implementation
azure/cosmos/_metadata_hedging.pyandazure/cosmos/aio/_metadata_hedging.py— bounded primary + single-hedge handlers reusing the existingAvailabilityStrategyHandlerMixinfor endpoint resolution and excluded-region routing. Includes:max(cpu_count, 2 × budget)so each in-flight hedge always has its two threads (sync),is_regional_failure) + hedge401/403auth-reject overlay (a hedge auth failure can never win),_availability_strategy_config.py—MetadataCrossRegionHedgingStrategy(fixed threshold),resolve_metadata_hedging_opt_in, constants._request_object.py—metadataCachePopulationoptions flag +is_metadata_cache_population.SynchronizedRequest/AsynchronousRequestgain_is_metadata_hedging_applicable(requires the cache-population flag) and route eligible reads through the metadata handler; the option is threaded through both clients and connections.tests/test_metadata_hedging.py+tests/test_metadata_hedging_async.py(25 unit tests, no live service).Self-review (skeptic lens)
Three independent reviewers were run on the diff. Key outcomes folded into the latest commit:
record_failureno-op (metadata cache health is account-global, not per-partition).should_cancel_requestis checked every retry), PK-range continuation pages are safe cross-region (pkranges metadata is account-global), and the async budget check is not a race.Accepted design choices
None→ follow PPAF), now narrowly scoped to cold-start cache reads — consistent with the existing data-plane PPAF hedging default.Validation
pylintclean on the new modules.