fix: race condition on get when add received after informer cache read#3237
fix: race condition on get when add received after informer cache read#3237csviri wants to merge 8 commits intooperator-framework:mainfrom
Conversation
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
|
Putting here related log messages: |
There was a problem hiding this comment.
Pull request overview
Addresses a rare race condition when reading resources from the temporary cache vs informer cache, which can lead to unexpected cache misses during reconciliation (and rare test failures).
Changes:
- Adjusts
ManagedInformerEventSource#get(ResourceID)to prefer returning the temporary-cache resource when the informer cache doesn’t contain the resource. - Adds additional debug logging around the informer’s last sync resource version and cache hit/miss state.
You can also share your feedback on Copilot code review. Take the survey.
| log.debug( | ||
| "Latest sync version: {}", | ||
| manager().lastSyncResourceVersion(resourceID.getNamespace().orElse(null))); |
| // order of getting those resource from cache matters | ||
| Optional<R> resource = temporaryResourceCache.getResourceFromCache(resourceID); | ||
| var res = cache.get(resourceID); | ||
| if (log.isDebugEnabled()) { | ||
| log.debug( | ||
| "Latest sync version: {}", | ||
| manager().lastSyncResourceVersion(resourceID.getNamespace().orElse(null))); | ||
| } |
| "Resource found in temp cache: {}, or older, in temporary cache. Found in informer cache" | ||
| + " {}, for Resource ID: {}", | ||
| resource.isPresent(), | ||
| res.isPresent(), | ||
| resourceID); |
|
@shawkins at the end I changed to use the cached resource for comparison since that allows also easier debugging |
Do you really want snapshot isolation here? Naively I would have thought that all logic should be tolerant to eventual consistencty. If you want repeatable reads that seems more like a concern that would be per reconciliation. |
That is not the goal really, it is just much easier to follow the logic this way and fixes the race condition |
There was a problem hiding this comment.
Pull request overview
Fixes an extremely rare race condition when reading a resource from ManagedInformerEventSource#get(ResourceID) by preferring the temporary cache entry when it is newer than (or missing from) the informer cache, avoiding reliance on a namespace-level “last sync” version that can change due to unrelated events.
Changes:
- Adjust cache selection logic in
get(ResourceID)to compare the temporary cache entry directly against the informer cache entry’sresourceVersion. - Update debug logging around cache hits/misses to include temp-cache presence.
Comments suppressed due to low confidence (1)
operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java:215
- The debug log message is hard to parse ("Resource found in temp cache: {}, or older, in temporary cache") and the first placeholder is a boolean, not the resource/age. Consider rewording to something unambiguous like separate flags for temp/informer presence and (if relevant) whether the temp entry was newer.
resource.isPresent(),
res.isPresent(),
resourceID);
return res;
}
You can also share your feedback on Copilot code review. Take the survey.
...io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java
Outdated
Show resolved
Hide resolved
...io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java
Show resolved
Hide resolved
...io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java
Show resolved
Hide resolved
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
@shawkins feel free to come up with an alternative approach |
There will still be race conditions. The change here opens the logic up more to returning stale or ghost entries - it really just depends upon whatever lag exists between what the informer cache has seen and the events processed by the temporary resource cache. |
So the primary goal, is to fix the issue that this test fails: I don't see how ghost resources cleanup interference with this, could you please describe a scenario for race condition? @shawkins |
I'm not saying that it interfers, I mean that the existing check prevents ManagedInformerEventSource from return knowingly stale / ghost entries wrt to the current state of the cache. The change relaxes that.
May need more investigation. If the informer cache has moved ahead, and no entry exists for cachedCM, yet you expect that the api server state should have that ConfigMap, then there's a problem. |
Yes, that is true. With this change ghost resources are just an eventual consistency problem, while it fixes a correctness problem. Also note that ghost resources are quite special case, that requires a delete operation on the created resource with timing when informer down and the requires a re-list. With the fabric8 improvement event that is minimized. I would say, let's give it some time think a bit if we can find a better solution, but if not I would prefer go this way. |
I wouldn't proceed with this change - that could just be masking another problem. Do you have a full log including the client debug? If the entry is supposed to be in the informer cache, but isn't we need to know why. |
No we can turn on additional logs, but I think there is clearly this case when this can happen: On:
This scenario corresponds with what I see in the logs. Theoretically we could re-read the cache also on line 210 if the res is empty as a workaround. |
|
added alternative: #3238 |
|
Ok, so to do this consistently we need to read first from TRC, then check the last sync version, then read from the informer cache if needed. |
This section in the test fails in extremely rare occasion:
java-operator-sdk/operator-framework/src/test/java/io/javaoperatorsdk/operator/baseapi/cachingfilteringupdate/CachingFilteringUpdateReconciler.java
Lines 46 to 51 in aeffc77
The only way I was able to explain this based on the logs attached below, is that it can happen that, in the get method, here:
java-operator-sdk/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/source/informer/ManagedInformerEventSource.java
Lines 192 to 211 in 5daec11
That we found the resource in temp cache but not in informer cache, then we receive an event that bumps the last sync version, so the temp cache resource is not returned. I added in the PR a failsafe for this case.
It might be actually more efficient to work only with the cached resource and not the lastSyncVersion that would basically provide a "snapshot isolation".
cc @shawkins
Signed-off-by: Attila Mészáros a_meszaros@apple.com