Untyped ValueError/AssertionError bug#47091
Conversation
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.
Changes:
- Introduces
_OverlapDetectedas a dedicated overlap sentinel and improves overlap diagnostics inCollectionRoutingMap.is_complete_set_of_range. - Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
- Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py | Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py | Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py | Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py | Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py | Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py | Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion. |
| sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py | Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics. |
|
✅ Review complete (52:47) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (51:24) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:
Both are caused by the same class of transient metadata inconsistency, but with different shapes:
Why this change is needed
We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:
So transient metadata blips could fail user workloads instead of being treated as retryable.
What changed
Reviewer-relevant behavior change
Before:
After:
Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.