xDS: RDS unmarshal skips weight=0 clusters causing ADS validation failures

- What version of gRPC are you using?
**google.golang.org/grpc v1.75.1**

- What version of Go are you using (go version)?
**go version go1.24.7 linux/amd64**

- What operating system (Linux, Windows, …) and version?
Linux (Kubernetes containerized environment, Alpine/Debian base images)

- What did you do?
I'm implementing Blue-Green deployments using gRPC xDS (Aggregated Discovery Service) with weighted clusters. When configuring RDS with a weight=0 cluster for complete traffic isolation (100:0 or 0:100 splits), new gRPC client pods fail to start.
```
  Configuration:
  // RDS RouteConfiguration sent by control plane
  weighted_clusters {
    clusters {
      name: "peakbench-dev/peakbench-blue"
      weight { value: 100 }
    }
    clusters {
      name: "peakbench-dev/peakbench-green"
      weight { value: 0 }  // Weight=0 for complete isolation
    }
  }
```
  Client behavior:
  1. Client receives RDS with both clusters (blue: 100, green: 0)
  2. Client parses RDS and extracts cluster names for CDS subscription
  3. Client requests CDS resources

- What did you expect to see?

  According to the https://javadoc.io/static/io.envoyproxy.controlplane/api/0.1.24/io/envoyproxy/envoy/api/v2/route/WeightedCluster.ClusterWeight.html, cluster weight is "An integer between 0 and total_weight" - weight=0 is explicitly allowed.

  Expected behavior:
  1. Client should include both clusters in route.WeightedClusters map (including weight=0)
  2. Client should request CDS for: ["peakbench-dev/peakbench-blue", "peakbench-dev/peakbench-green"]
  3. ADS consistency validation should pass
  4. weighted_target load balancer should handle weight=0 correctly (0% traffic to that cluster)
  5. Client connection should succeed with READY state

- What did you see instead?

  Actual behavior:

  From client debug logs:
  I0129 01:06:50 [OnStreamResponse] RouteConfiguration:
    clusters:{name:"peakbench-dev/peakbench-green" weight:{}}           ← weight=0
    clusters:{name:"peakbench-dev/peakbench-blue" weight:{value:100}}

  I0129 01:06:50 [grpc-debug] CDS Request: resourceNames=[peakbench-dev/peakbench-blue]  ← Only blue!

  W0129 01:06:50 [go-control-plane] ADS mode: not responding to request
    type.googleapis.com/envoy.config.cluster.v3.Cluster[peakbench-dev/peakbench-blue]:
    "peakbench-dev/peakbench-green" not listed

  Result: Client connection stuck in TRANSIENT_FAILURE state indefinitely.

- Root cause analysis:

  In xds/internal/xdsclient/xdsresource/unmarshal_rds.go lines 321-323:
```
  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      if w == 0 {
          continue  // ← PROBLEM: Skips weight=0 clusters
      }
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Never added for weight=0
  }
```
  Because weight=0 clusters are skipped:
  1. They're not added to route.WeightedClusters map
  2. Client doesn't know to request them from CDS
  3. go-control-plane's ADS superset validation fails (cluster in snapshot but not in client request)

  Impact:
  - ❌ Cannot cold-start gRPC clients with 100:0 or 0:100 traffic configurations
  - ✅ Running clients can transition TO/FROM 100:0 smoothly (because they already have the clusters cached)
  - ✅ HTTP/Envoy clients work fine at 100:0 (Envoy correctly handles weight=0)

- Impact and why this is critical:

  Without this fix, we face a dilemma between cold start and smooth rollback:

  Choice A: Server-side skip weight=0 in RDS (single-cluster approach)
  100:0 → RDS: [blue:100]         (only blue cluster)
  0:100 → RDS: [green:100]        (only green cluster)
  - ✅ Cold start works (no ADS validation error)
  - ❌ Transitions cause data plane disruption:
    - When rolling back from 0:100 → 100:0, RDS changes from [green] to [blue]
    - weighted_target balancer removes green sub-balancer and adds blue sub-balancer
    - Causes connection draining and reconnection = data plane impact during emergency rollback
    - Fast rollback becomes impossible - defeats the purpose of Blue-Green deployment

  Choice B: Server-side include weight=0 in RDS (dual-cluster approach)
  100:0 → RDS: [blue:100, green:0]   (both clusters)
  0:100 → RDS: [blue:0, green:100]   (both clusters)
  - ✅ Transitions are smooth (weighted_target keeps all sub-balancers, only weights change)
  - ✅ Fast rollback works - critical for production incidents
  - ❌ Cold start fails (gRPC client skips weight=0 in unmarshal, ADS validation fails)

**Why we need both to work:**
  - Smooth rollback (dual-cluster) is critical for production reliability
  - Cold start (weight=0 support) is required for pod restarts, scaling, and new deployments
  - The bug forces us to choose between operational safety and deployment capability

- Workaround applied:

  We patched the vendor code to remove the if w == 0 { continue } logic:
```
  // PATCH: Include weight=0 clusters for ADS consistency
  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      // Don't skip weight=0 - it's valid per xDS spec
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Added even for weight=0
  }

  // PATCH: Validate cluster count instead of totalWeight
  if len(route.WeightedClusters) == 0 {
      return nil, nil, fmt.Errorf("route %+v, action %+v, has no cluster in WeightedCluster action", r, a)
  }
```
  After the patch:
  - ✅ Client requests CDS for both clusters (including weight=0)
  - ✅ ADS validation passes
  - ✅ weighted_target correctly routes 0% traffic to weight=0 cluster
  - ✅ Cold start succeeds at 100:0 and 0:100
  - ✅ Fast rollback works without data plane impact


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xDS: RDS unmarshal skips weight=0 clusters causing ADS validation failures #8865

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

xDS: RDS unmarshal skips weight=0 clusters causing ADS validation failures #8865

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions