feat(spanner): add shared endpoint cooldowns for location-aware rerouting by rahul2393 · Pull Request #12832 · googleapis/google-cloud-java

rahul2393 · 2026-04-17T16:30:21Z

Summary

This PR improves Java Spanner's location-aware routing behavior when a routed replica returns RESOURCE_EXHAUSTED.

Instead of immediately sending retries or subsequent requests back to the same replica, the client now keeps a shared cooldown for overloaded endpoints and reroutes traffic to another eligible endpoint when possible. The PR also reduces hot-path contention in the location-aware range cache by removing per-group synchronization from the read path.

What changed

Shared endpoint cooldowns

Added EndpointOverloadCooldownTracker to track short-lived cooldowns for routed endpoints that return RESOURCE_EXHAUSTED.
Kept cooldown state at the KeyAwareChannel level so it is shared across requests instead of recreated per call.
Continued to use request-scoped excluded endpoints for same-logical-request rerouting, while also honoring channel-level cooldowns for later requests.

Routing behavior

Updated KeyAwareChannel to:
- record routed overloaded endpoints in the cooldown tracker
- combine request-scoped exclusions with channel-level cooldown checks
- expose test hooks for cooldown/exclusion assertions
Updated GapicSpannerRpc so streaming reads can retry on
RESOURCE_EXHAUSTED, which allows bypass traffic to move to another replica.

Hot-path optimization

Refactored KeyRangeCache group state to immutable snapshots.
Removed synchronized group selection from the location-aware read path.
Kept routing semantics unchanged:
- same leader fast path
- same directed-read filtering
- same skipped-tablet reporting
- same background endpoint recreation behavior

Test coverage

Added a shared-backend replica harness for end-to-end location-aware routing scenarios.
Added coverage for:
- rerouting single-use reads/queries after RESOURCE_EXHAUSTED
- next-request cooldown behavior
- request-scoped exclusion consumption
- endpoint cooldown visibility from KeyAwareChannel

Why

Before this change, location-aware bypass traffic could immediately route back to the same overloaded replica after a routed RESOURCE_EXHAUSTED, especially for later requests. That made rerouting less effective under overload and added avoidable hot-path contention in the cache.

This PR makes rerouting behavior closer to the intended client-side policy:

same logical request retries avoid the failed endpoint once
later requests avoid recently overloaded endpoints for a short cooldown window
routing state is shared at the client/channel level
read-path locking is reduced without weakening routing correctness

…ting

gemini-code-assist

Code Review

This pull request introduces an EndpointOverloadCooldownTracker to manage endpoint cooldowns after RESOURCE_EXHAUSTED failures, which is integrated into KeyAwareChannel to influence endpoint selection and avoid overloaded replicas. The KeyRangeCache was refactored to use immutable TabletSnapshot and GroupSnapshot for thread-safe access to tablet metadata, improving routing logic. The GapicSpannerRpc now includes RESOURCE_EXHAUSTED in its streaming read retryable codes to enable this rerouting. New test infrastructure (SharedBackendReplicaHarness) and updated tests (LocationAwareSharedBackendReplicaHarnessTest, KeyAwareChannelTest) validate this behavior, including handling of resume tokens in MockSpannerServiceImpl. A review comment points out that adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that might lead to increased latency and unnecessary retries for non-routed traffic, suggesting a need to consider scoping this change or accepting its broader implications.

gemini-code-assist · 2026-04-17T16:34:58Z

+            ImmutableSet.<Code>builder()
+                .addAll(
+                    options.getSpannerStubSettings().streamingReadSettings().getRetryableCodes())
+                .add(Code.RESOURCE_EXHAUSTED)
+                .build();


Adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that affects all users of the library, not just those using location-aware routing. While this enables the rerouting logic for overloaded replicas, it also means that any RESOURCE_EXHAUSTED error (including those due to administrative quotas) will now be retried by the GAX layer. This might lead to increased latency and unnecessary retries for permanent quota issues. Consider if this can be scoped only to location-aware calls or if the implications for non-routed traffic are acceptable.

References

Avoid introducing breaking changes to public APIs, even if they have not been part of a public release.

feat(spanner): add shared endpoint cooldowns for location-aware rerou…

ea55529

…ting

rahul2393 requested review from a team as code owners April 17, 2026 16:30

gemini-code-assist bot reviewed Apr 17, 2026

View reviewed changes

rahul2393 and others added 2 commits April 18, 2026 03:48

retry unavailable errors on different replica

2598291

chore: generate libraries at Fri Apr 17 22:22:01 UTC 2026

5eb78c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832
rahul2393 wants to merge 3 commits intomainfrom
endpoint-cooldown-re

rahul2393 commented Apr 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rahul2393 commented Apr 17, 2026

Summary

What changed

Shared endpoint cooldowns

Routing behavior

Hot-path optimization

Test coverage

Why

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants