Skip to content

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832

Open
rahul2393 wants to merge 3 commits intomainfrom
endpoint-cooldown-re
Open

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832
rahul2393 wants to merge 3 commits intomainfrom
endpoint-cooldown-re

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Summary

This PR improves Java Spanner's location-aware routing behavior when a routed replica returns RESOURCE_EXHAUSTED.

Instead of immediately sending retries or subsequent requests back to the same replica, the client now keeps a shared cooldown for overloaded endpoints and reroutes traffic to another eligible endpoint when possible. The PR also reduces hot-path contention in the location-aware range cache by removing per-group synchronization from the read path.

What changed

Shared endpoint cooldowns

  • Added EndpointOverloadCooldownTracker to track short-lived cooldowns for routed endpoints that return RESOURCE_EXHAUSTED.
  • Kept cooldown state at the KeyAwareChannel level so it is shared across requests instead of recreated per call.
  • Continued to use request-scoped excluded endpoints for same-logical-request rerouting, while also honoring channel-level cooldowns for later requests.

Routing behavior

  • Updated KeyAwareChannel to:
    • record routed overloaded endpoints in the cooldown tracker
    • combine request-scoped exclusions with channel-level cooldown checks
    • expose test hooks for cooldown/exclusion assertions
  • Updated GapicSpannerRpc so streaming reads can retry on
    RESOURCE_EXHAUSTED, which allows bypass traffic to move to another replica.

Hot-path optimization

  • Refactored KeyRangeCache group state to immutable snapshots.
  • Removed synchronized group selection from the location-aware read path.
  • Kept routing semantics unchanged:
    • same leader fast path
    • same directed-read filtering
    • same skipped-tablet reporting
    • same background endpoint recreation behavior

Test coverage

  • Added a shared-backend replica harness for end-to-end location-aware routing scenarios.
  • Added coverage for:
    • rerouting single-use reads/queries after RESOURCE_EXHAUSTED
    • next-request cooldown behavior
    • request-scoped exclusion consumption
    • endpoint cooldown visibility from KeyAwareChannel

Why

Before this change, location-aware bypass traffic could immediately route back to the same overloaded replica after a routed RESOURCE_EXHAUSTED, especially for later requests. That made rerouting less effective under overload and added avoidable hot-path contention in the cache.

This PR makes rerouting behavior closer to the intended client-side policy:

  • same logical request retries avoid the failed endpoint once
  • later requests avoid recently overloaded endpoints for a short cooldown window
  • routing state is shared at the client/channel level
  • read-path locking is reduced without weakening routing correctness

@rahul2393 rahul2393 requested review from a team as code owners April 17, 2026 16:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an EndpointOverloadCooldownTracker to manage endpoint cooldowns after RESOURCE_EXHAUSTED failures, which is integrated into KeyAwareChannel to influence endpoint selection and avoid overloaded replicas. The KeyRangeCache was refactored to use immutable TabletSnapshot and GroupSnapshot for thread-safe access to tablet metadata, improving routing logic. The GapicSpannerRpc now includes RESOURCE_EXHAUSTED in its streaming read retryable codes to enable this rerouting. New test infrastructure (SharedBackendReplicaHarness) and updated tests (LocationAwareSharedBackendReplicaHarnessTest, KeyAwareChannelTest) validate this behavior, including handling of resume tokens in MockSpannerServiceImpl. A review comment points out that adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that might lead to increased latency and unnecessary retries for non-routed traffic, suggesting a need to consider scoping this change or accepting its broader implications.

Comment on lines +435 to +439
ImmutableSet.<Code>builder()
.addAll(
options.getSpannerStubSettings().streamingReadSettings().getRetryableCodes())
.add(Code.RESOURCE_EXHAUSTED)
.build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that affects all users of the library, not just those using location-aware routing. While this enables the rerouting logic for overloaded replicas, it also means that any RESOURCE_EXHAUSTED error (including those due to administrative quotas) will now be retried by the GAX layer. This might lead to increased latency and unnecessary retries for permanent quota issues. Consider if this can be scoped only to location-aware calls or if the implications for non-routed traffic are acceptable.

References
  1. Avoid introducing breaking changes to public APIs, even if they have not been part of a public release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants