Skip to content

xDS: RDS unmarshal skips weight=0 clusters causing ADS validation failures #8865

@flyingyang

Description

@flyingyang
  • What version of gRPC are you using?
    google.golang.org/grpc v1.75.1

  • What version of Go are you using (go version)?
    go version go1.24.7 linux/amd64

  • What operating system (Linux, Windows, …) and version?
    Linux (Kubernetes containerized environment, Alpine/Debian base images)

  • What did you do?
    I'm implementing Blue-Green deployments using gRPC xDS (Aggregated Discovery Service) with weighted clusters. When configuring RDS with a weight=0 cluster for complete traffic isolation (100:0 or 0:100 splits), new gRPC client pods fail to start.

  Configuration:
  // RDS RouteConfiguration sent by control plane
  weighted_clusters {
    clusters {
      name: "peakbench-dev/peakbench-blue"
      weight { value: 100 }
    }
    clusters {
      name: "peakbench-dev/peakbench-green"
      weight { value: 0 }  // Weight=0 for complete isolation
    }
  }

Client behavior:

  1. Client receives RDS with both clusters (blue: 100, green: 0)
  2. Client parses RDS and extracts cluster names for CDS subscription
  3. Client requests CDS resources
  • What did you expect to see?

    According to the https://javadoc.io/static/io.envoyproxy.controlplane/api/0.1.24/io/envoyproxy/envoy/api/v2/route/WeightedCluster.ClusterWeight.html, cluster weight is "An integer between 0 and total_weight" - weight=0 is explicitly allowed.

    Expected behavior:

    1. Client should include both clusters in route.WeightedClusters map (including weight=0)
    2. Client should request CDS for: ["peakbench-dev/peakbench-blue", "peakbench-dev/peakbench-green"]
    3. ADS consistency validation should pass
    4. weighted_target load balancer should handle weight=0 correctly (0% traffic to that cluster)
    5. Client connection should succeed with READY state
  • What did you see instead?

    Actual behavior:

    From client debug logs:
    I0129 01:06:50 [OnStreamResponse] RouteConfiguration:
    clusters:{name:"peakbench-dev/peakbench-green" weight:{}} ← weight=0
    clusters:{name:"peakbench-dev/peakbench-blue" weight:{value:100}}

    I0129 01:06:50 [grpc-debug] CDS Request: resourceNames=[peakbench-dev/peakbench-blue] ← Only blue!

    W0129 01:06:50 [go-control-plane] ADS mode: not responding to request
    type.googleapis.com/envoy.config.cluster.v3.Cluster[peakbench-dev/peakbench-blue]:
    "peakbench-dev/peakbench-green" not listed

    Result: Client connection stuck in TRANSIENT_FAILURE state indefinitely.

  • Root cause analysis:

    In xds/internal/xdsclient/xdsresource/unmarshal_rds.go lines 321-323:

  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      if w == 0 {
          continue  // ← PROBLEM: Skips weight=0 clusters
      }
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Never added for weight=0
  }

Because weight=0 clusters are skipped:

  1. They're not added to route.WeightedClusters map
  2. Client doesn't know to request them from CDS
  3. go-control-plane's ADS superset validation fails (cluster in snapshot but not in client request)

Impact:

  • ❌ Cannot cold-start gRPC clients with 100:0 or 0:100 traffic configurations

  • ✅ Running clients can transition TO/FROM 100:0 smoothly (because they already have the clusters cached)

  • ✅ HTTP/Envoy clients work fine at 100:0 (Envoy correctly handles weight=0)

  • Impact and why this is critical:

    Without this fix, we face a dilemma between cold start and smooth rollback:

    Choice A: Server-side skip weight=0 in RDS (single-cluster approach)
    100:0 → RDS: [blue:100] (only blue cluster)
    0:100 → RDS: [green:100] (only green cluster)

    • ✅ Cold start works (no ADS validation error)
    • ❌ Transitions cause data plane disruption:
      • When rolling back from 0:100 → 100:0, RDS changes from [green] to [blue]
      • weighted_target balancer removes green sub-balancer and adds blue sub-balancer
      • Causes connection draining and reconnection = data plane impact during emergency rollback
      • Fast rollback becomes impossible - defeats the purpose of Blue-Green deployment

    Choice B: Server-side include weight=0 in RDS (dual-cluster approach)
    100:0 → RDS: [blue:100, green:0] (both clusters)
    0:100 → RDS: [blue:0, green:100] (both clusters)

    • ✅ Transitions are smooth (weighted_target keeps all sub-balancers, only weights change)
    • ✅ Fast rollback works - critical for production incidents
    • ❌ Cold start fails (gRPC client skips weight=0 in unmarshal, ADS validation fails)

Why we need both to work:

  • Smooth rollback (dual-cluster) is critical for production reliability

  • Cold start (weight=0 support) is required for pod restarts, scaling, and new deployments

  • The bug forces us to choose between operational safety and deployment capability

  • Workaround applied:

    We patched the vendor code to remove the if w == 0 { continue } logic:

  // PATCH: Include weight=0 clusters for ADS consistency
  for _, c := range wcs.Clusters {
      w := c.GetWeight().GetValue()
      // Don't skip weight=0 - it's valid per xDS spec
      totalWeight += uint64(w)
      // ...
      route.WeightedClusters[c.GetName()] = wc  // Added even for weight=0
  }

  // PATCH: Validate cluster count instead of totalWeight
  if len(route.WeightedClusters) == 0 {
      return nil, nil, fmt.Errorf("route %+v, action %+v, has no cluster in WeightedCluster action", r, a)
  }

After the patch:

  • ✅ Client requests CDS for both clusters (including weight=0)
  • ✅ ADS validation passes
  • ✅ weighted_target correctly routes 0% traffic to weight=0 cluster
  • ✅ Cold start succeeds at 100:0 and 0:100
  • ✅ Fast rollback works without data plane impact

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions