PKRange Cache Warm Up by dibahlfi · Pull Request #47066 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-22T00:53:32Z

This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.

This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:

Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.

Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.

Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.

dibahlfi · 2026-05-22T00:53:54Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T00:54:14Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T00:54:29Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.

Changes:

Add shared in-flight fetch task tracking in async PartitionKeyRangeCache and await via asyncio.shield to decouple cache publication from caller cancellation.
Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py`	Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight.
`sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py`	Adds sync coverage ensuring tight `timeout=` kwargs are forwarded and caching still populates on success; expands concurrency test commentary.
`sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py`	Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants.
`sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py`	Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation.

xinlian12 · 2026-05-22T01:43:56Z

✅ Review complete (48:46)

Posted 4 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-22T03:23:29Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T03:23:44Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T03:23:59Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-22T04:06:32Z

✅ Review complete (42:20)

Posted 3 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-22T05:20:50Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T06:06:49Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T06:07:05Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-22T15:05:48Z

+        # that swapped ``.clear()`` for ``= {}`` would leave the cache
+        # empty here because the in-flight task would have written into
+        # the now-orphan old dict.
+        self.assertEqual(len(c1._collection_routing_map_by_item), 1)


🟢 Suggestion — Testing: Dict-identity assertion doesn't actually pin the invariant it claims to

The step 4 comment says a regression from .clear() to = {} in clear_cache "would leave the cache empty here" — but this assertion wouldn't catch that regression. Here's why:

If clear_cache were changed to _shared_routing_map_cache[self._endpoint] = {}:

c1._collection_routing_map_by_item still references the old dict (bound in __init__)

The in-flight task also references the old dict (via self._collection_routing_map_by_item)

The task writes into the old dict → len(c1._collection_routing_map_by_item) == 1 → test passes ✅

But a new PartitionKeyRangeCache for the same endpoint would get the new empty dict from _shared_routing_map_cache[ep] and see no cached data ❌

Adding one assertion would truly pin the invariant:

from azure.cosmos._routing.aio.routing_map_provider import _shared_routing_map_cache self.assertIs( c1._collection_routing_map_by_item, _shared_routing_map_cache[ep], "clear_cache must use in-place .clear(), not reassignment, " "so the provider's dict reference stays in sync with the module registry", )

This assertIs check would fail under = {} because the module registry holds the new dict while the provider holds the old one.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-05-22T15:06:06Z

✅ Review complete (44:47)

Posted 1 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

PKRange Cache fix

904b3e1

dibahlfi requested a review from a team as a code owner May 22, 2026 00:53

Copilot AI review requested due to automatic review settings May 22, 2026 00:53

Copilot started reviewing on behalf of dibahlfi May 22, 2026 00:53 View session

github-actions Bot added the Cosmos label May 22, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Outdated

Comment thread sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py Outdated