-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
-
What version of gRPC are you using?
google.golang.org/grpc v1.75.1 -
What version of Go are you using (go version)?
go version go1.24.7 linux/amd64 -
What operating system (Linux, Windows, …) and version?
Linux (Kubernetes containerized environment, Alpine/Debian base images) -
What did you do?
I'm implementing Blue-Green deployments using gRPC xDS (Aggregated Discovery Service) with weighted clusters. When configuring RDS with a weight=0 cluster for complete traffic isolation (100:0 or 0:100 splits), new gRPC client pods fail to start.
Configuration:
// RDS RouteConfiguration sent by control plane
weighted_clusters {
clusters {
name: "peakbench-dev/peakbench-blue"
weight { value: 100 }
}
clusters {
name: "peakbench-dev/peakbench-green"
weight { value: 0 } // Weight=0 for complete isolation
}
}
Client behavior:
- Client receives RDS with both clusters (blue: 100, green: 0)
- Client parses RDS and extracts cluster names for CDS subscription
- Client requests CDS resources
-
What did you expect to see?
According to the https://javadoc.io/static/io.envoyproxy.controlplane/api/0.1.24/io/envoyproxy/envoy/api/v2/route/WeightedCluster.ClusterWeight.html, cluster weight is "An integer between 0 and total_weight" - weight=0 is explicitly allowed.
Expected behavior:
- Client should include both clusters in route.WeightedClusters map (including weight=0)
- Client should request CDS for: ["peakbench-dev/peakbench-blue", "peakbench-dev/peakbench-green"]
- ADS consistency validation should pass
- weighted_target load balancer should handle weight=0 correctly (0% traffic to that cluster)
- Client connection should succeed with READY state
-
What did you see instead?
Actual behavior:
From client debug logs:
I0129 01:06:50 [OnStreamResponse] RouteConfiguration:
clusters:{name:"peakbench-dev/peakbench-green" weight:{}} ← weight=0
clusters:{name:"peakbench-dev/peakbench-blue" weight:{value:100}}I0129 01:06:50 [grpc-debug] CDS Request: resourceNames=[peakbench-dev/peakbench-blue] ← Only blue!
W0129 01:06:50 [go-control-plane] ADS mode: not responding to request
type.googleapis.com/envoy.config.cluster.v3.Cluster[peakbench-dev/peakbench-blue]:
"peakbench-dev/peakbench-green" not listedResult: Client connection stuck in TRANSIENT_FAILURE state indefinitely.
-
Root cause analysis:
In xds/internal/xdsclient/xdsresource/unmarshal_rds.go lines 321-323:
for _, c := range wcs.Clusters {
w := c.GetWeight().GetValue()
if w == 0 {
continue // ← PROBLEM: Skips weight=0 clusters
}
totalWeight += uint64(w)
// ...
route.WeightedClusters[c.GetName()] = wc // Never added for weight=0
}
Because weight=0 clusters are skipped:
- They're not added to route.WeightedClusters map
- Client doesn't know to request them from CDS
- go-control-plane's ADS superset validation fails (cluster in snapshot but not in client request)
Impact:
-
❌ Cannot cold-start gRPC clients with 100:0 or 0:100 traffic configurations
-
✅ Running clients can transition TO/FROM 100:0 smoothly (because they already have the clusters cached)
-
✅ HTTP/Envoy clients work fine at 100:0 (Envoy correctly handles weight=0)
-
Impact and why this is critical:
Without this fix, we face a dilemma between cold start and smooth rollback:
Choice A: Server-side skip weight=0 in RDS (single-cluster approach)
100:0 → RDS: [blue:100] (only blue cluster)
0:100 → RDS: [green:100] (only green cluster)- ✅ Cold start works (no ADS validation error)
- ❌ Transitions cause data plane disruption:
- When rolling back from 0:100 → 100:0, RDS changes from [green] to [blue]
- weighted_target balancer removes green sub-balancer and adds blue sub-balancer
- Causes connection draining and reconnection = data plane impact during emergency rollback
- Fast rollback becomes impossible - defeats the purpose of Blue-Green deployment
Choice B: Server-side include weight=0 in RDS (dual-cluster approach)
100:0 → RDS: [blue:100, green:0] (both clusters)
0:100 → RDS: [blue:0, green:100] (both clusters)- ✅ Transitions are smooth (weighted_target keeps all sub-balancers, only weights change)
- ✅ Fast rollback works - critical for production incidents
- ❌ Cold start fails (gRPC client skips weight=0 in unmarshal, ADS validation fails)
Why we need both to work:
-
Smooth rollback (dual-cluster) is critical for production reliability
-
Cold start (weight=0 support) is required for pod restarts, scaling, and new deployments
-
The bug forces us to choose between operational safety and deployment capability
-
Workaround applied:
We patched the vendor code to remove the if w == 0 { continue } logic:
// PATCH: Include weight=0 clusters for ADS consistency
for _, c := range wcs.Clusters {
w := c.GetWeight().GetValue()
// Don't skip weight=0 - it's valid per xDS spec
totalWeight += uint64(w)
// ...
route.WeightedClusters[c.GetName()] = wc // Added even for weight=0
}
// PATCH: Validate cluster count instead of totalWeight
if len(route.WeightedClusters) == 0 {
return nil, nil, fmt.Errorf("route %+v, action %+v, has no cluster in WeightedCluster action", r, a)
}
After the patch:
- ✅ Client requests CDS for both clusters (including weight=0)
- ✅ ADS validation passes
- ✅ weighted_target correctly routes 0% traffic to weight=0 cluster
- ✅ Cold start succeeds at 100:0 and 0:100
- ✅ Fast rollback works without data plane impact