PKRange Cache Warm Up#47066
Conversation
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Improves Azure Cosmos DB async routing-map (PKRange) cache warm-up reliability under aggressive caller timeouts/cancellation by introducing a shared “single-flight” in-flight fetch task per (event loop, collection), so cache publication can complete even if the originating awaiter is cancelled.
Changes:
- Add shared in-flight fetch task tracking in async
PartitionKeyRangeCacheand await viaasyncio.shieldto decouple cache publication from caller cancellation. - Extend async/sync routing-map provider tests to cover timeout-kwarg forwarding, cancellation survival, single-flight behavior, and in-flight cleanup.
- Update shared-cache lifecycle tests to reset/validate the new shared in-flight state.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/cosmos/azure-cosmos/tests/routing/test_shared_pk_range_cache_async.py |
Clears new shared in-flight state between tests; adds lifecycle test for releasing while a fetch is in flight. |
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider.py |
Adds sync coverage ensuring tight timeout= kwargs are forwarded and caching still populates on success; expands concurrency test commentary. |
sdk/cosmos/azure-cosmos/tests/routing/test_routing_map_provider_async.py |
Adds async coverage for cancellation survival, single-flight coalescing, in-flight cleanup on success/failure, and concurrency invariants. |
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py |
Implements shared in-flight fetch-and-publish tasks and shields awaiters so cache warm-up can complete despite caller cancellation. |
|
✅ Review complete (48:46) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (42:20) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| # that swapped ``.clear()`` for ``= {}`` would leave the cache | ||
| # empty here because the in-flight task would have written into | ||
| # the now-orphan old dict. | ||
| self.assertEqual(len(c1._collection_routing_map_by_item), 1) |
There was a problem hiding this comment.
🟢 Suggestion — Testing: Dict-identity assertion doesn't actually pin the invariant it claims to
The step 4 comment says a regression from .clear() to = {} in clear_cache "would leave the cache empty here" — but this assertion wouldn't catch that regression. Here's why:
If clear_cache were changed to _shared_routing_map_cache[self._endpoint] = {}:
c1._collection_routing_map_by_itemstill references the old dict (bound in__init__)- The in-flight task also references the old dict (via
self._collection_routing_map_by_item) - The task writes into the old dict →
len(c1._collection_routing_map_by_item) == 1→ test passes ✅ - But a new
PartitionKeyRangeCachefor the same endpoint would get the new empty dict from_shared_routing_map_cache[ep]and see no cached data ❌
Adding one assertion would truly pin the invariant:
from azure.cosmos._routing.aio.routing_map_provider import _shared_routing_map_cache
self.assertIs(
c1._collection_routing_map_by_item,
_shared_routing_map_cache[ep],
"clear_cache must use in-place .clear(), not reassignment, "
"so the provider's dict reference stays in sync with the module registry",
)This assertIs check would fail under = {} because the module registry holds the new dict while the provider holds the old one.
|
✅ Review complete (44:47) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
This PR fixes a bug where if a customer set a short deadline on a request and that deadline ran out while the address-list lookup was still in progress, the lookup was thrown away. The customer's next call would start the same lookup again from scratch, hit the same deadline, and fail the same way. A short deadline on a slow network could keep a customer stuck in this loop indefinitely.
This PR fixes the lookup so it survives the customer's deadline. Two improvements ship together, each addressing a different way the deadline could reach the lookup:
Improvement 1 (applies to the async client). The lookup now keeps running in the background even if the customer's wait runs out. The customer still sees their timeout, but the work the SDK started isn't lost — it finishes a moment later and the result is saved. The customer's retry finds the answer already there and proceeds immediately, with no extra round-trip to the service.
Improvement 2 (applies to both sync and async clients). A few SDK methods (most visibly read_feed_ranges) pass the customer's timeout all the way down to the internal lookup. That meant the customer's "2-second budget for this call" was also bounding the internal address-list lookup, which could itself time out before the customer's actual work even started. The SDK now keeps the customer's deadline scoped to the work the customer actually asked about; the internal lookup runs under the SDK's own retry rules.
Scope (intentional)
This change only affects the internal address-list cache for container partitions. Other internal caches the SDK keeps (such as container-properties) are not changed: they are populated by a single small request that doesn't have the same failure pattern, so changing them would add risk without a customer-visible benefit.