Skip to content

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845

Open
rahul2393 wants to merge 3 commits intogoogleapis:mainfrom
rahul2393:endpoint-cooldown-re
Open

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845
rahul2393 wants to merge 3 commits intogoogleapis:mainfrom
rahul2393:endpoint-cooldown-re

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Summary

This PR improves Java Spanner's location-aware routing behavior when a routed replica returns RESOURCE_EXHAUSTED.

Instead of immediately sending retries or subsequent requests back to the same replica, the client now keeps a shared cooldown for overloaded endpoints and reroutes traffic to another eligible endpoint when possible. The PR also reduces hot-path contention in the location-aware range cache by removing per-group synchronization from the read path.

What changed

Shared endpoint cooldowns

  • Added EndpointOverloadCooldownTracker to track short-lived cooldowns for routed endpoints that return RESOURCE_EXHAUSTED.
  • Kept cooldown state at the KeyAwareChannel level so it is shared across requests instead of recreated per call.
  • Continued to use request-scoped excluded endpoints for same-logical-request rerouting, while also honoring channel-level cooldowns for later requests.

Routing behavior

  • Updated KeyAwareChannel to:
    • record routed overloaded endpoints in the cooldown tracker
    • combine request-scoped exclusions with channel-level cooldown checks
    • expose test hooks for cooldown/exclusion assertions
  • Updated GapicSpannerRpc so streaming reads can retry on
    RESOURCE_EXHAUSTED, which allows bypass traffic to move to another replica.

Hot-path optimization

  • Refactored KeyRangeCache group state to immutable snapshots.
  • Removed synchronized group selection from the location-aware read path.
  • Kept routing semantics unchanged:
    • same leader fast path
    • same directed-read filtering
    • same skipped-tablet reporting
    • same background endpoint recreation behavior

Test coverage

  • Added a shared-backend replica harness for end-to-end location-aware routing scenarios.
  • Added coverage for:
    • rerouting single-use reads/queries after RESOURCE_EXHAUSTED
    • next-request cooldown behavior
    • request-scoped exclusion consumption
    • endpoint cooldown visibility from KeyAwareChannel

Why

Before this change, location-aware bypass traffic could immediately route back to the same overloaded replica after a routed RESOURCE_EXHAUSTED, especially for later requests. That made rerouting less effective under overload and added avoidable hot-path contention in the cache.

This PR makes rerouting behavior closer to the intended client-side policy:

  • same logical request retries avoid the failed endpoint once
  • later requests avoid recently overloaded endpoints for a short cooldown window
  • routing state is shared at the client/channel level
  • read-path locking is reduced without weakening routing correctness

@rahul2393 rahul2393 requested review from a team as code owners April 17, 2026 22:34
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an endpoint cooldown mechanism to handle RESOURCE_EXHAUSTED errors and refactors the KeyRangeCache to use immutable snapshots, replacing per-group locking to improve read performance. The new EndpointOverloadCooldownTracker manages short-lived cooldowns with exponential backoff and jitter, while KeyAwareChannel is updated to exclude endpoints on both RESOURCE_EXHAUSTED and UNAVAILABLE status codes. Feedback is provided to optimize the GroupSnapshot constructor by removing a redundant list copy.

Comment on lines +573 to +577
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GroupSnapshot constructor performs a redundant copy of the tablets list. Since the only caller (CachedGroup.update) already creates a new ArrayList, we can wrap it directly in an unmodifiable list to avoid unnecessary allocations.

Suggested change
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));
}
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(tablets);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant