feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832
feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12832
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an EndpointOverloadCooldownTracker to manage endpoint cooldowns after RESOURCE_EXHAUSTED failures, which is integrated into KeyAwareChannel to influence endpoint selection and avoid overloaded replicas. The KeyRangeCache was refactored to use immutable TabletSnapshot and GroupSnapshot for thread-safe access to tablet metadata, improving routing logic. The GapicSpannerRpc now includes RESOURCE_EXHAUSTED in its streaming read retryable codes to enable this rerouting. New test infrastructure (SharedBackendReplicaHarness) and updated tests (LocationAwareSharedBackendReplicaHarnessTest, KeyAwareChannelTest) validate this behavior, including handling of resume tokens in MockSpannerServiceImpl. A review comment points out that adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that might lead to increased latency and unnecessary retries for non-routed traffic, suggesting a need to consider scoping this change or accepting its broader implications.
| ImmutableSet.<Code>builder() | ||
| .addAll( | ||
| options.getSpannerStubSettings().streamingReadSettings().getRetryableCodes()) | ||
| .add(Code.RESOURCE_EXHAUSTED) | ||
| .build(); |
There was a problem hiding this comment.
Adding RESOURCE_EXHAUSTED to the default retryable codes for all streaming reads is a significant change that affects all users of the library, not just those using location-aware routing. While this enables the rerouting logic for overloaded replicas, it also means that any RESOURCE_EXHAUSTED error (including those due to administrative quotas) will now be retried by the GAX layer. This might lead to increased latency and unnecessary retries for permanent quota issues. Consider if this can be scoped only to location-aware calls or if the implications for non-routed traffic are acceptable.
References
- Avoid introducing breaking changes to public APIs, even if they have not been part of a public release.
Summary
This PR improves Java Spanner's location-aware routing behavior when a routed replica returns
RESOURCE_EXHAUSTED.Instead of immediately sending retries or subsequent requests back to the same replica, the client now keeps a shared cooldown for overloaded endpoints and reroutes traffic to another eligible endpoint when possible. The PR also reduces hot-path contention in the location-aware range cache by removing per-group synchronization from the read path.
What changed
Shared endpoint cooldowns
EndpointOverloadCooldownTrackerto track short-lived cooldowns for routed endpoints that returnRESOURCE_EXHAUSTED.KeyAwareChannellevel so it is shared across requests instead of recreated per call.Routing behavior
KeyAwareChannelto:GapicSpannerRpcso streaming reads can retry onRESOURCE_EXHAUSTED, which allows bypass traffic to move to another replica.Hot-path optimization
KeyRangeCachegroup state to immutable snapshots.synchronizedgroup selection from the location-aware read path.Test coverage
RESOURCE_EXHAUSTEDKeyAwareChannelWhy
Before this change, location-aware bypass traffic could immediately route back to the same overloaded replica after a routed
RESOURCE_EXHAUSTED, especially for later requests. That made rerouting less effective under overload and added avoidable hot-path contention in the cache.This PR makes rerouting behavior closer to the intended client-side policy: