Skip to content

Untyped ValueError/AssertionError bug#47091

Open
dibahlfi wants to merge 3 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix
Open

Untyped ValueError/AssertionError bug#47091
dibahlfi wants to merge 3 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 22, 2026

This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:

  1. ValueError("Ranges overlap ...")
  2. AssertionError("code bug: returned overlapping ranges ... is empty")

Both are caused by the same class of transient metadata inconsistency, but with different shapes:

  • overlap: two ranges claim the same key space
  • gap: no range claims part of key space

Why this change is needed

We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:

  • overlap could escape as a hard ValueError
  • gap could flow through as an incomplete map and later crash in SmartRoutingMapProvider with the empty-overlap assertion

So transient metadata blips could fail user workloads instead of being treated as retryable.

What changed

  • Unified handling for full-load transient snapshot inconsistencies (overlap + gap):
  • detect inconsistency
  • retry with bounded jittered backoff
  • if still inconsistent after retry budget, raise typed CosmosHttpResponseError(status_code=503) so existing retry policy can absorb it
  • Kept incremental-path safety:
  • overlap from incremental merge is converted into incremental-merge failure/fallback behavior (with guarded conversion so unrelated ValueErrors still surface normally)
  • Added dedup-by-range-id before validation to tolerate duplicate IDs across paginated snapshots from slightly skewed gateway views.
  • Applied consistently in sync and async routing map providers.

Reviewer-relevant behavior change
Before:

  • transient overlap/gap could surface as internal exceptions (ValueError / assertion)

After:

  • transient inconsistency is retried internally
  • persistent inconsistency surfaces as retryable 503 (instead of non-retryable internal crash)

Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.

@dibahlfi dibahlfi requested a review from a team as a code owner May 22, 2026 22:14
Copilot AI review requested due to automatic review settings May 22, 2026 22:14
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.

Changes:

  • Introduces _OverlapDetected as a dedicated overlap sentinel and improves overlap diagnostics in CollectionRoutingMap.is_complete_set_of_range.
  • Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
  • Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion.
sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics.

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
@xinlian12
Copy link
Copy Markdown
Member

Review complete (52:47)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi dibahlfi changed the title fixing ValueError bug Untyped ValueError/AssertionError bug May 23, 2026
@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

Review complete (51:24)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants