[release/9.0-staging] Add an in-memory cache for CRLs on Linux#127625
Open
bartonjs wants to merge 2 commits intodotnet:release/9.0-stagingfrom
Open
[release/9.0-staging] Add an in-memory cache for CRLs on Linux#127625bartonjs wants to merge 2 commits intodotnet:release/9.0-stagingfrom
bartonjs wants to merge 2 commits intodotnet:release/9.0-stagingfrom
Conversation
Introduce an extra layer of caching for CRLs. * The cache has a fixed size of 30 elements. When full, it evicts the least-recently-used entry. * Using the same finalizable object sentinel approach as ArrayPool, the cache will purge entries every time the GC finalizes. * During a finalize, the current MRU node is marked as what to purge next time. * Using that node moves the purge target to the next-older entry before the node is promoted back to MRU. * On the subsequent finalize, the marked node (and everything after it) are purged. * To avoid finalizing the CRL SafeHandles, the cache does an AddReference on every item that is returned (so the caller must Release it), and it calls Dispose on anything it evicts. * GC/Finalization-triggered cooperative eviction is not performed until the probe object is promoted to the final GC generation (currently Gen2) --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Contributor
|
Tagging subscribers to this area: @bartonjs, @vcsjones, @dotnet/area-system-security |
Contributor
There was a problem hiding this comment.
Pull request overview
This backport introduces an in-memory MRU cache for OpenSSL CRL handling on Linux to reduce repeated CRL parsing/allocations and mitigate high resident memory growth from frequent revocation checks across threads.
Changes:
- Added an in-memory bounded MRU cache layer for CRLs (with GC-cooperative pruning) and integrated it into the OpenSSL CRL load/attach flow.
- Extended OpenSSL chain ETW diagnostics with new in-memory CRL cache events and clarified existing “disk cache” messages.
- Added an OuterLoop Unix test that corrupts a persisted CRL file and validates that a subsequent process rebuild recovers the disk cache.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/libraries/System.Security.Cryptography/tests/X509Certificates/X509FilesystemTests.Unix.cs | Adds a cross-process disk-cache recovery test and an ETW listener helper to discover the CRL cache filename. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainEventSource.cs | Adds ETW events for in-memory CRL cache hit/miss/expired/prune/full and updates disk-cache message text. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslCrlCache.cs | Implements and wires up the in-memory MRU CRL cache, refactors disk-cache load to return cached entries, and adds GC-driven pruning. |
| @@ -174,28 +262,16 @@ private static bool AddCachedCrlCore(string crlFile, SafeX509StoreHandle store, | |||
| OpenSslX509ChainEventSource.Log.CrlCacheExpired(nextUpdate, verificationTime); | |||
Comment on lines
+322
to
+350
| // Saving the CRL to the disk is just a performance optimization for later requests to not | ||
| // need to use the network again, so failure to save shouldn't throw an exception or mark | ||
| // the chain as invalid. | ||
| try | ||
| { | ||
| string crlFile = GetCachedCrlPath(crlFileName, mkDir: true); | ||
|
|
||
| Interop.Crypto.ErrClearError(); | ||
| } | ||
| using (SafeBioHandle bio = Interop.Crypto.BioNewFile(crlFile, "wb")) | ||
| { | ||
| if (bio.IsInvalid || Interop.Crypto.PemWriteBioX509Crl(bio, crl) == 0) | ||
| { | ||
| // No bio, or write failed | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteFailed(crlFile); | ||
| } | ||
| } | ||
| catch (UnauthorizedAccessException) { } | ||
| catch (IOException) { } | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteSucceeded(); | ||
| Interop.Crypto.ErrClearError(); | ||
| } | ||
| } | ||
| } | ||
| catch (UnauthorizedAccessException) { } | ||
| catch (IOException) { } | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteSucceeded(); | ||
| } |
| { | ||
| RemoteCertificateValidationCallback = (sender, certificate, chain, errors) => | ||
| { | ||
| getDotNetCert = (X509Certificate2)certificate; |
Comment on lines
+774
to
+790
| if (eventData.Payload[0] is string certName && | ||
| certName == _certificateName && | ||
| eventData.Payload[1] is string cdp && | ||
| eventData.Payload[2] is string cacheName) | ||
| { | ||
| _cacheName = cacheName; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| internal async Task<string> GetCacheFileNameAsync(CancellationToken cancellationToken) | ||
| { | ||
| while (_cacheName == null) | ||
| { | ||
| await Task.Delay(100, cancellationToken).ConfigureAwait(false); | ||
| } |
Comment on lines
+743
to
+752
| ~GCWatcher() | ||
| { | ||
| GC.ReRegisterForFinalize(this); | ||
|
|
||
| if (GC.GetGeneration(this) == GC.MaxGeneration) | ||
| { | ||
| try | ||
| { | ||
| _owner.PruneForGC(); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Manual backport/cherry-pick of #123562 to release/9.0-staging
Customer Impact
Customers have been reporting "memory leaks" on Linux related to CRL processing for quite a while. These "leaks" aren't actual leaks, but an interaction with how OpenSSL processes CRLs (using many small calls to malloc), and glibc memory arenas and small-allocation caching -- glibc holds onto the small allocs from free so it can hand them out again later.
Because we handle CRLs by loading them, checking them, and discarding them, a process that does a lot of revocation checks will end up checking the CRL on every thread, and thus can end up with large "reserved" memory for their process. As the size of the CRL goes up, the number of threads goes up, and memory limits come down (e.g. Kubernetes) the reserved memory becomes more of a potential problem.
This change (originally introduced for 11 preview 3) changes the CRL processing to use an in-memory bounded MRU cache with GC cooperation. So, a process that repeatedly hits the same endpoints over and over (or even multiple endpoints from the same CA+CRL) ideally only ever has to load the CRL once (unless it expires). Since it isn't freed while still in use, it doesn't contribute to small-allocation accumulation.
Regression
Testing
As with the PR into main, most of the tests are existing tests. A new test is included to show cross-process disk cache recovery.
Risk
Medium-Low. The MRU cache isn't just a drop-in layering piece, so the volume of code carries inherent risk. The risk is largely mitigated by a large amount of coverage from unit tests, and manual stress tests against the feature when it was written in main (cycling through a few hundred HTTPS hosts randomly across several threads while doing a large amount of background memory allocation/deallocation).
Users experiencing the high RES memory problem on Linux have reported that .NET 11 Preview 3 ameliorated the problem. Otherwise, no feedback has been received regarding the change (implying that it has not caused a problem for anyone).