feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845
feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845rahul2393 wants to merge 3 commits intogoogleapis:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an endpoint cooldown mechanism to handle RESOURCE_EXHAUSTED errors and refactors the KeyRangeCache to use immutable snapshots, replacing per-group locking to improve read performance. The new EndpointOverloadCooldownTracker manages short-lived cooldowns with exponential backoff and jitter, while KeyAwareChannel is updated to exclude endpoints on both RESOURCE_EXHAUSTED and UNAVAILABLE status codes. Feedback is provided to optimize the GroupSnapshot constructor by removing a redundant list copy.
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | ||
| this.generation = generation; | ||
| this.leaderIndex = leaderIndex; | ||
| this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets)); | ||
| } |
There was a problem hiding this comment.
The GroupSnapshot constructor performs a redundant copy of the tablets list. Since the only caller (CachedGroup.update) already creates a new ArrayList, we can wrap it directly in an unmodifiable list to avoid unnecessary allocations.
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | |
| this.generation = generation; | |
| this.leaderIndex = leaderIndex; | |
| this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets)); | |
| } | |
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | |
| this.generation = generation; | |
| this.leaderIndex = leaderIndex; | |
| this.tablets = Collections.unmodifiableList(tablets); | |
| } |
Summary
This PR improves Java Spanner's location-aware routing behavior when a routed replica returns
RESOURCE_EXHAUSTED.Instead of immediately sending retries or subsequent requests back to the same replica, the client now keeps a shared cooldown for overloaded endpoints and reroutes traffic to another eligible endpoint when possible. The PR also reduces hot-path contention in the location-aware range cache by removing per-group synchronization from the read path.
What changed
Shared endpoint cooldowns
EndpointOverloadCooldownTrackerto track short-lived cooldowns for routed endpoints that returnRESOURCE_EXHAUSTED.KeyAwareChannellevel so it is shared across requests instead of recreated per call.Routing behavior
KeyAwareChannelto:GapicSpannerRpcso streaming reads can retry onRESOURCE_EXHAUSTED, which allows bypass traffic to move to another replica.Hot-path optimization
KeyRangeCachegroup state to immutable snapshots.synchronizedgroup selection from the location-aware read path.Test coverage
RESOURCE_EXHAUSTEDKeyAwareChannelWhy
Before this change, location-aware bypass traffic could immediately route back to the same overloaded replica after a routed
RESOURCE_EXHAUSTED, especially for later requests. That made rerouting less effective under overload and added avoidable hot-path contention in the cache.This PR makes rerouting behavior closer to the intended client-side policy: