[RFC] Update WDS definition to include service scope #1708

krinkinmu · 2026-01-09T17:27:51Z

I brough the topic a while back in the WG meeting and started a discussion in slack about it, but we didn't really reach any conclusions yet.

Background: in ambient multi-cluster deployment we control service discoverability by marking them as global (or not). If a service marked as global, the way the API is documented, we expect it to be discoverable from other clusters, if it's not marked as global, then the service is not discoverable from other clusters.

At the moment we only support uniform configurations, e.g., when the same service is either marked as global everywhere or nowhere. It's debatable if we want/need to support non-uniform configurations, so for now we don't commit to decision.

Services may have waypoints configured. When a service has a waypoint, ztunnel when it makes a decision on where to route the request picks one of the waypoint endpoints instead of actual service endpoints.

The problem: waypoints can be made global or local independently of the service. For example, the same waypoint can be shared between a local and a global service, and presumably, be marked as global in this case.

When a service is local, but the waypoint that serves it is global we may end up in the following situation:

ztunnel handles an outgoing request for the local service with global waypoint
because service has a waypoint ztunnel need to pick a waypoint endpoint to proxy the request to
- normally, we'd pick a waypoint endpoint that is closest to the ztunnel node, because waypoints are configured PreferClose by default
- in the unlikely case when we don't have any healthy local waypoint endpoint, ztunnel may route to an endpoint in the remote cluster (remember that waypoint itself is global in this scenario)
waypoint in the remote cluster receives the request, applies L7 policies and picks a backend in the same cluster as the waypoint:
- it's ok for the waypoint to pick a backend in the same cluster as the waypoint in general
- however, in this case, it would mean that request for a local service traveled to another cluster, which goes against the service scope.

All-in-all, we might send a request for a local service to a remote cluster instead of failing it.

NOTE: In case of multi-network multicluster deployment, the request will fail in the E/W gateway in this case, which is better than what we have in single network multicluster, but worse than just failing it in ztunnel.

Currently ztunnel does not actually have any information about the scope of the service. Pilot naturally knows the scope of the service, but it does not distribute it to ztunnels in any way.

Proposed solution:
I propose a WDS change, so that when we push the service metadata to ztunnel we include in that metadata service scope.

With service scope available to ztunnel, when ztunnel selects a waypoint endpoint it can discard those endpoints that don't match the scope of the service, thus avoid proxying traffic to a remote cluster for local service.

In the currently supported cases, where the service is either global everywhere or local everywhere, it's enough to just report whether the service is global or local. When the service is local, it means that we can only allow waypoint endpoints in the same cluster and should discard all other endpoints, even if it means failing the request.

When the service is global, it means that there are no restrictions on service endpoints and we can route the request to any cluster without restrictions.

NOTE: If at some point in the future we decide to support non-uniform configurations, we'd have to extend this API to provide somehow the list of allowed or prohibited clusters for services that are not configured uniformly. For now though we don't commit to supporting such a use case and we haven't heard of any users that would want it.

Other considerations:

Istio also supports waypoints for workloads, even though it's not the default. Workloads don't have a scope like services do, so currently ztunnel can pick any waypoint endpoint when we are targeting a workload, including a waypoint on another network, which is not correct.

To address this problem for workloads we don't actually need to make any changes to WDS though. When we target a workload, it should be enough to limit waypoint endpoints to the endpoints on the same network when we target a workload instead of a service.

Related to istio/istio#57710

NOTE: I also have an implementation of the ztunnel and pilot changes, but I cannot put ztunnel and pilot changes in one PR and opted to separate discussion of the WDS changes from the actual ztunnel implementation, though if anybody would find it helpful to see those changes I can provide them as well.

CC a few people involved into the discussion of this issue at different times explicitly:

@keithmattix @therealmitchconnors @ilrudie @howardjohn @jaellio @Stevenjin8 @grnmeira

I brough the topic a while back in the WG meeting and started a discussion in slack about it, but we didn't really reach any conclusions yet. Background: in ambient multi-cluster deployment we control service discoverability by marking them as global (or not). If a service marked as global, the way the API is documented, we expect it to be discoverable from other clusters, if it's not marked as global, then the service is not discoverable from other clusters. At the moment we only support uniform configurations, e.g., when the same service is either marked as global everywhere or nowhere. It's debatable if we want/need to support non-uniform configurations, so for now we don't commit to decision. Services may have waypoints configured. When a service has a waypoint, ztunnel when it makes a decision on where to route the request picks one of the waypoint endpoints instead of actual service endpoints. The problem: waypoints can be made global or local independently of the service. For example, the same waypoint can be shared between a local and a global service, and presumably, be marked as global in this case. When a service is local, but the waypoint that serves it is global we may end up in the following situation: 1. ztunnel handles an outgoing request for the local service with global waypoint 2. because service has a waypoint ztunnel need to pick a waypoint endpoint to proxy the request to - normally, we'd pick a waypoint endpoint that is closest to the ztunnel node, because waypoints are configured PreferClose by default - in the unlikely case when we don't have any healthy local waypoint endpoint, ztunnel may route to an endpoint in the remote cluster (rememeber that waypoint itself is global in this scenario) 3. waypoint in the remote cluster receives the request, applies L7 policies and picks a backend in the same cluster as the waypoint: - it's ok for the waypoint to pick a backend in the same cluster as the waypoint in general - however in this case, it would mean that request for a local service traveled to another cluster, which goes against the service scope. All-in-all, we might send a request for a local service to a remote cluster instead of failing it. NOTE: In case of multi-network multicluster deployment, the request will fail in the E/W gateway in this case, which is better than what we have in single-network multicluster, but worse than just failing it in ztunnel. Currently ztunnel does not actually have any information about the scope of the service. Pilot naturally knows the scope of the service, but it does not distribute it to ztunnels in any way. Proposed solution: I propose a WDS change, so that when we push the service metadata to ztunnel we include in that metadata service scope. With service scope available to ztunnel, when ztunnel selects a waypoint endpoint it can discard those endpoints that don't match the scope of the service, thus avoid proxying traffic to a remote cluster for local service. In the currently supported cases, where the service is either global everywhere or local everywhere, it's enough to just report whether the service is global or local. When the service is local, it means that we can only allow waypoint endpoints in the same cluster and should discard all other endpoints, even if it means failing the request. When the service is global, it means that there are no restrictions on service endpoints and we can route the request to any cluster without restrictions. NOTE: If at some point in the future we decide to support non-uniform configurations, we'd have to extend this API to provide somehow the list of allowed or prohibited clusters for services that are not configured uniformly. For now though we don't commit to supporting such a use case and we haven't heard of any users that would want it. Other considerations: Istio also supports waypoints for workloads, even though it's not the default. Workloads don't have a scope like services do, so currently ztunnel can pick any waypoint endpoint when we targeting a workload, including a waypoint on another network, which is not correct. To address this problem for workloads we don't actually need to make any changes to WDS though. When we target a workload it should be enough to limit waypoint endpoints to the endpoints on the same network when we target a workload instead of a service. Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com>

istio-policy-bot · 2026-01-09T17:27:55Z

😊 Welcome @krinkinmu! This is either your first contribution to the Istio ztunnel repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

istio-testing · 2026-01-10T06:50:25Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

krinkinmu requested a review from a team as a code owner January 9, 2026 17:27

istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-rebase Indicates a PR needs to be rebased before being merged labels Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Update WDS definition to include service scope #1708

[RFC] Update WDS definition to include service scope #1708

Uh oh!

krinkinmu commented Jan 9, 2026

Uh oh!

istio-policy-bot commented Jan 9, 2026

Uh oh!

istio-testing commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[RFC] Update WDS definition to include service scope #1708

Are you sure you want to change the base?

[RFC] Update WDS definition to include service scope #1708

Uh oh!

Conversation

krinkinmu commented Jan 9, 2026

Uh oh!

istio-policy-bot commented Jan 9, 2026

Uh oh!

istio-testing commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants