[release/8.0-staging] Add an in-memory cache for CRLs on Linux#127626
[release/8.0-staging] Add an in-memory cache for CRLs on Linux#127626bartonjs wants to merge 4 commits intodotnet:release/8.0-stagingfrom
Conversation
Introduce an extra layer of caching for CRLs. * The cache has a fixed size of 30 elements. When full, it evicts the least-recently-used entry. * Using the same finalizable object sentinel approach as ArrayPool, the cache will purge entries every time the GC finalizes. * During a finalize, the current MRU node is marked as what to purge next time. * Using that node moves the purge target to the next-older entry before the node is promoted back to MRU. * On the subsequent finalize, the marked node (and everything after it) are purged. * To avoid finalizing the CRL SafeHandles, the cache does an AddReference on every item that is returned (so the caller must Release it), and it calls Dispose on anything it evicts. * GC/Finalization-triggered cooperative eviction is not performed until the probe object is promoted to the final GC generation (currently Gen2) --------- Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @bartonjs, @vcsjones, @dotnet/area-system-security |
There was a problem hiding this comment.
Pull request overview
Backports the OpenSSL CRL processing changes to reduce Linux “high RES memory” behavior by adding a bounded in-memory MRU cache (with GC-cooperative pruning) on top of the existing on-disk CRL cache, plus additional diagnostics and a new Unix test covering disk-cache recovery.
Changes:
- Add an in-memory MRU cache layer for CRL handles and integrate it into CRL attach/load flow (memory hit/expired/miss → disk cache → download).
- Extend
OpenSslX509ChainEventSourcewith new verbose events for in-memory cache behavior and clarify existing disk-cache event messages. - Add a Unix outerloop test that truncates a persisted CRL file and validates cross-process recovery.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/libraries/System.Security.Cryptography/tests/X509Certificates/X509FilesystemTests.Unix.cs | Adds an outerloop test for CRL disk-cache recovery and an EventListener helper to observe the cache filename. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainEventSource.cs | Adds new in-memory CRL cache ETW events and clarifies disk-cache messages. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslCrlCache.cs | Introduces MRU in-memory cache for CRLs and refactors disk-cache/download paths to feed the cache. |
| @@ -174,28 +262,16 @@ private static bool AddCachedCrlCore(string crlFile, SafeX509StoreHandle store, | |||
| OpenSslX509ChainEventSource.Log.CrlCacheExpired(nextUpdate, verificationTime); | |||
| try | ||
| { | ||
| string crlFile = GetCachedCrlPath(crlFileName, mkDir: true); | ||
|
|
||
| Interop.Crypto.ErrClearError(); | ||
| } | ||
| using (SafeBioHandle bio = Interop.Crypto.BioNewFile(crlFile, "wb")) | ||
| { | ||
| if (bio.IsInvalid || Interop.Crypto.PemWriteBioX509Crl(bio, crl) == 0) | ||
| { | ||
| // No bio, or write failed | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteFailed(crlFile); | ||
| } | ||
| } | ||
| catch (UnauthorizedAccessException) { } | ||
| catch (IOException) { } | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteSucceeded(); | ||
| Interop.Crypto.ErrClearError(); | ||
| } | ||
| } | ||
| } | ||
| catch (UnauthorizedAccessException) { } | ||
| catch (IOException) { } | ||
|
|
||
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteSucceeded(); | ||
| } |
| [OuterLoop] | ||
| [ConditionalFact(typeof(RemoteExecutor), nameof(RemoteExecutor.IsSupported))] | ||
| public static async Task CrlDiskCacheRecovers() | ||
| { | ||
| using X509Certificate2 getDotNetCert = await GetGetDotNetCert(); | ||
| string crlFileName; | ||
|
|
||
| using (CrlCacheNameFinderEventListener listener = new(getDotNetCert.Subject)) | ||
| using (CancellationTokenSource tokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(10))) | ||
| using (ChainHolder chainHolder = new ChainHolder()) | ||
| { | ||
| Task<string> nameTask = listener.GetCacheFileNameAsync(tokenSource.Token); | ||
|
|
||
| _ = chainHolder.Chain.Build(getDotNetCert); | ||
| crlFileName = await nameTask.ConfigureAwait(false); | ||
| } |
| SslOptions = | ||
| { | ||
| RemoteCertificateValidationCallback = (sender, certificate, chain, errors) => | ||
| { | ||
| getDotNetCert = (X509Certificate2)certificate; | ||
| return errors == SslPolicyErrors.None; | ||
| } |
| protected override void OnEventWritten(EventWrittenEventArgs eventData) | ||
| { | ||
| if (eventData.EventName == "CrlIdentifiersDetermined") | ||
| { | ||
| if (eventData.Payload?.Count == 3) | ||
| { | ||
| if (eventData.Payload[0] is string certName && | ||
| certName == _certificateName && | ||
| eventData.Payload[1] is string cdp && | ||
| eventData.Payload[2] is string cacheName) | ||
| { | ||
| _cacheName = cacheName; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| internal async Task<string> GetCacheFileNameAsync(CancellationToken cancellationToken) | ||
| { | ||
| while (_cacheName == null) | ||
| { | ||
| await Task.Delay(100, cancellationToken).ConfigureAwait(false); | ||
| } |
There was a problem hiding this comment.
Pull request overview
Backports the OpenSSL (Linux) CRL handling change to introduce a bounded in-memory MRU cache (with GC-cooperative pruning) layered above the existing disk cache, reducing repeated CRL loads and mitigating high “reserved” memory growth caused by OpenSSL+glibc small-allocation behavior during revocation checking.
Changes:
- Add an in-memory MRU cache for CRLs in
OpenSslCrlCache, with GC-triggered pruning and new diagnostic events. - Update
OpenSslX509ChainEventSourcemessages/events to distinguish disk cache activity and report in-memory cache behavior. - Add an OuterLoop cross-process test validating recovery when the on-disk CRL cache file becomes corrupted/truncated.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/libraries/System.Security.Cryptography/tests/X509Certificates/X509FilesystemTests.Unix.cs | Adds an OuterLoop test that corrupts the CRL disk cache and validates a separate process repairs it. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslX509ChainEventSource.cs | Adds in-memory CRL cache ETW events and clarifies existing disk-cache event messages. |
| src/libraries/System.Security.Cryptography/src/System/Security/Cryptography/X509Certificates/OpenSslCrlCache.cs | Implements bounded in-memory MRU CRL cache, integrates it into chain-building flow, and adds GC-cooperative pruning. |
| if (OpenSslX509ChainEventSource.Log.IsEnabled()) | ||
| { | ||
| OpenSslX509ChainEventSource.Log.CrlCacheWriteSucceeded(); | ||
| } |
| internal async Task<string> GetCacheFileNameAsync(CancellationToken cancellationToken) | ||
| { | ||
| while (_cacheName == null) | ||
| { | ||
| await Task.Delay(100, cancellationToken).ConfigureAwait(false); | ||
| } |
| @@ -174,28 +262,16 @@ private static bool AddCachedCrlCore(string crlFile, SafeX509StoreHandle store, | |||
| OpenSslX509ChainEventSource.Log.CrlCacheExpired(nextUpdate, verificationTime); | |||
| ~GCWatcher() | ||
| { | ||
| GC.ReRegisterForFinalize(this); | ||
|
|
||
| if (GC.GetGeneration(this) == GC.MaxGeneration) |
Manual backport/cherry-pick of #123562 to release/8.0-staging
Customer Impact
Customers have been reporting "memory leaks" on Linux related to CRL processing for quite a while. These "leaks" aren't actual leaks, but an interaction with how OpenSSL processes CRLs (using many small calls to malloc), and glibc memory arenas and small-allocation caching -- glibc holds onto the small allocs from free so it can hand them out again later.
Because we handle CRLs by loading them, checking them, and discarding them, a process that does a lot of revocation checks will end up checking the CRL on every thread, and thus can end up with large "reserved" memory for their process. As the size of the CRL goes up, the number of threads goes up, and memory limits come down (e.g. Kubernetes) the reserved memory becomes more of a potential problem.
This change (originally introduced for 11 preview 3) changes the CRL processing to use an in-memory bounded MRU cache with GC cooperation. So, a process that repeatedly hits the same endpoints over and over (or even multiple endpoints from the same CA+CRL) ideally only ever has to load the CRL once (unless it expires). Since it isn't freed while still in use, it doesn't contribute to small-allocation accumulation.
Regression
Testing
As with the PR into main, most of the tests are existing tests. A new test is included to show cross-process disk cache recovery.
Risk
Medium-Low. The MRU cache isn't just a drop-in layering piece, so the volume of code carries inherent risk. The risk is largely mitigated by a large amount of coverage from unit tests, and manual stress tests against the feature when it was written in main (cycling through a few hundred HTTPS hosts randomly across several threads while doing a large amount of background memory allocation/deallocation).
Users experiencing the high RES memory problem on Linux have reported that .NET 11 Preview 3 ameliorated the problem. Otherwise, no feedback has been received regarding the change (implying that it has not caused a problem for anyone).