Skip to content

feat(xds): Add filter state reference-counted resource management#12879

Open
sauravzg wants to merge 1 commit into
grpc:masterfrom
sauravzg:sauravz/filter-state-retention
Open

feat(xds): Add filter state reference-counted resource management#12879
sauravzg wants to merge 1 commit into
grpc:masterfrom
sauravzg:sauravz/filter-state-retention

Conversation

@sauravzg

Copy link
Copy Markdown
Contributor
  • Introduce SharedResourceManager to manage reference-counted shared resources across xDS filters and resolvers.
  • Update XdsNameResolver to integrate with SharedResourceManager and support filter state retention.
  • Add RefCountedRoute and RefCountedRouteInterceptor to ensure resources are reliably retained during call interception and released across all RPC lifecycle termination paths (normal completion, cancellation, and exceptions).
  • Update Filter interfaces and existing implementations (FaultFilter, GcpAuthenticationFilter) to accommodate filter state sharing.

@sauravzg sauravzg requested a review from kannanjgithub June 23, 2026 04:14
* {@link #release(Object)} API designed for xDS filter state cleanup tasks.
*/
@Internal
public final class SharedResourceManager<K, V extends SharedResourceManager.ResourceCloseable> {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, one of the things that I was in thoughts about was , if we should make this class simpler by making this entire map class synchronized (like ReferenceCountingMap in io..xds....security) instead of trying to deal with corner cases around atomic consistency which are difficult to understand and document.

Another thing worth considering is that potentially all operations on the map happen in syncContext since this map is only used when calling buildInterceptor which happens in syncContext , the release also either happens during similar control plane updates so synccontext.
It's possible for the cleanup to happen after an RPC end or close , but that's also covered since we do it inside synccontext using a RouteWrapper.
So, we may entirely not need this class to be thread safe.

So, I want opinions on where we should lean towards. IIUC, the ClusterRefState has semantics where it can be get from any thread and mutations from sync context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the synchronization down to the individual resource level and using a lock-free design for the hot path is superior. In a coarse-grained design where we do it at the SharedResourceManager (map) level, every single RPC across different ext-proc addresses would have to acquire a single global lock on SharedResourceManager during acquire and release.

It is not true that all operations on the map happen in syncContext. buildClientInterceptor is called when processing xDS updates on the Control plane in a syncContext. Not the SharedResourceManager.acquire and SharedResourceManager.release. acquire happen on the application's RPC thread and release on the transport thread that calls onClose when the rpc completes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual resource is already synchronized to some degree, but could be potential leaks and stuff if acquire isn't synchronized(two races on map insertion), races between acquire and close , where acquire happens after closure. Solving these with atomics is very error prone.

I think all operations on the map happen in syncContext. the release happens via releasing the Route wrapper, which ensures releases happen via sync context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release can happen via the route wrapper on the control plane (syncContext) thread, but only if there are zero active RPCs at the moment the route is closed. If there are active RPCs, the cleanup tasks are deferred. The moment the last active RPC completes, its gRPC transport thread (onClose callback) is the thread that actually executes the route decrement and triggers the deferred cleanup tasks. Because SharedResourceManager is a global singleton across the channel, two different routes (Route A and Route B) can concurrently process RPCs that talk to the same ext_proc sidecar address. When Route A is draining and its last RPC completes, it calls release(address) on the transport thread. At the exact same time, Route B might be starting a new RPC and calling acquire(address) on an application thread.

So the need for synchronization during release is absolutely there. It is also there for acquire as I previously wrote that the acquire happens from the rpc application thread.

About simplifying the complex CAS (Compare And Swap) code, we could use synchronized for SharedResource retain and acquire. In SharedResourceManager in addition to the code being complex you are acquiring a lock on closeLock for any key whose SharedResource hasn't been created yet, so if there is a huge number of concurrent calls all for different keys for the first time there is unnecessary lock overhead. ConcurrentHashMap.compute() locks only the specific hash bucket of the key being queried. By using compute and computeIfPresent in acquire and release respectively, we can completely eliminate the global closeLock.

@sauravzg sauravzg Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a bit of discussion last week,

  • it should be possible for all accesses to the map to be made in sync context since map accesses only happen on control plan. There's a catch about closure which can happen on rpc end to release the route, but the route can hold a reference to sync context to ensure the closure happens in sync contxt. This would in theory eliminate the requirement of thread safety from the map itself.
  • The resource however needs to be synchronized, since it'll get accessed from data plane rpcs. We brough forward dicussion about CAS in acquire but not in release, which was there because we validated count >0 in acquire , but did not have this validation in release . Updated release to have >0 validation and changed to CAS.
  • Regardlng elimination of closeLock, this may not trivially possible without moving channel creation in the compute blocks, which may end up holding the lock longer depending on what the channel creation does.

Regardless, I believe the choice of how the map should behave would dictate a lot of implementation for the map

  • if we choose to go non thread safe : We could remove all complexities (regular map and ints)
  • if we choose to be threadsafe (or partially threadsafe like clusterRefState) : We could either choose between current implementation, using compute instead of closelocks or going full synchronized like ReferenceCountingMap.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ConcurrentHashMap' for the SharedResourceManager and computeinacquirecauses the lock to be held for the duration of the channel creation only for that key bucket. Other key acquisitions from concurrent threads don't contend for the same lock. More than one key maps to the same bucket only in the case of hash key collisions which only happens after the number of keys have become substantially large. With the currentcloseLockapproach if there are n concurrentacquirecalls made for the same key, n resources will get created and n-1 of them will get destroyed, which is wasteful. This does not happen with theConcurrentHashMap.computeapproach foracquire`.

Fully synchronized SharedResourceManager will be bad for concurrency. It is a better and easier to understand option for SharedResource .

- Introduce `SharedResourceManager` to manage reference-counted shared
  resources across xDS filters and resolvers.
- Update `XdsNameResolver` to integrate with `SharedResourceManager` and
  support filter state retention.
- Add `RefCountedRoute` and `RefCountedRouteInterceptor` to ensure
  resources are reliably retained during call interception and released
  across all RPC lifecycle termination paths (normal completion,
  cancellation, and exceptions).
- Update `Filter` interfaces and existing implementations
  (`FaultFilter`, `GcpAuthenticationFilter`) to accommodate filter state
  sharing.
@sauravzg sauravzg force-pushed the sauravz/filter-state-retention branch from 4911d1c to 8624505 Compare June 30, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants