Skip to content

PKRange Cache Warm Up#47066

Open
dibahlfi wants to merge 3 commits into
mainfrom
users/dibahl/caching-foreground-bufg
Open

PKRange Cache Warm Up#47066
dibahlfi wants to merge 3 commits into
mainfrom
users/dibahl/caching-foreground-bufg

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 22, 2026

This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.

This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:

Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.

Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.

Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.

@dibahlfi dibahlfi requested a review from a team as a code owner May 22, 2026 00:53
Copilot AI review requested due to automatic review settings May 22, 2026 00:53
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.

Changes:

  • Add shared in-flight fetch task tracking in async PartitionKeyRangeCache and await via asyncio.shield to decouple cache publication from caller cancellation.
  • Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
  • Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight.
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py Adds sync coverage ensuring tight timeout= kwargs are forwarded and caching still populates on success; expands concurrency test commentary.
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation.

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Outdated
Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (48:46)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

Review complete (42:20)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

# that swapped ``.clear()`` for ``= {}`` would leave the cache
# empty here because the in-flight task would have written into
# the now-orphan old dict.
self.assertEqual(len(c1._collection_routing_map_by_item), 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Testing: Dict-identity assertion doesn't actually pin the invariant it claims to

The step 4 comment says a regression from .clear() to = {} in clear_cache "would leave the cache empty here" — but this assertion wouldn't catch that regression. Here's why:

If clear_cache were changed to _shared_routing_map_cache[self._endpoint] = {}:

  • c1._collection_routing_map_by_item still references the old dict (bound in __init__)
  • The in-flight task also references the old dict (via self._collection_routing_map_by_item)
  • The task writes into the old dict → len(c1._collection_routing_map_by_item) == 1test passes
  • But a new PartitionKeyRangeCache for the same endpoint would get the new empty dict from _shared_routing_map_cache[ep] and see no cached data ❌

Adding one assertion would truly pin the invariant:

from azure.cosmos._routing.aio.routing_map_provider import _shared_routing_map_cache
self.assertIs(
    c1._collection_routing_map_by_item,
    _shared_routing_map_cache[ep],
    "clear_cache must use in-place .clear(), not reassignment, "
    "so the provider's dict reference stays in sync with the module registry",
)

This assertIs check would fail under = {} because the module registry holds the new dict while the provider holds the old one.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12
Copy link
Copy Markdown
Member

Review complete (44:47)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants