Fix LoaderHeap's free list growing more than expected#129203
Fix LoaderHeap's free list growing more than expected#129203eduardo-vp wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR restructures UnlockedLoaderHeap’s free list to reduce allocation-time overhead in backout-heavy scenarios by replacing a single linear-scanned free list with size-segregated buckets for common small block sizes plus an overflow list for larger blocks.
Changes:
- Replaces
m_pFirstFreeBlockwith 32 size buckets (pointer-size increments) and a separate “large/overflow” free list. - Updates free-block insertion/allocation logic to use bucketed O(1) reuse for small sizes, and retains linear scanning only for the overflow list (including a stress-log warning on long scans).
- Adjusts debug-only free-list dumping and validation to iterate across all buckets and the overflow list.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/coreclr/utilcode/loaderheap.cpp | Implements bucket initialization, bucket-aware allocation/insertion, overflow scan warning, and updates debug dump/validation to traverse buckets. |
| src/coreclr/utilcode/loaderheap_shared.h | Updates LoaderHeapFreeBlock API to no longer take an explicit head pointer (heap chooses bucket internally). |
| src/coreclr/inc/loaderheap.h | Adds bucket/overflow free list fields and related constants to UnlockedLoaderHeap. |
Where is this race condition exactly? Can we add a lock there instead? The freelist in LoaderHeap is meant to be only used in error conditions to backout types that failed to load, or to deal with rare race condition. If you see the freelist growing this much, it means that the loader heap is not used correctly. We should fix that instead. |
|
I'll take a look, this the part where many threads lose runtime/src/coreclr/vm/genmeth.cpp Lines 493 to 542 in 24547a7 |
This reverts commit 96f0039.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Could you please run a perf test that tries to create many different method instantiations on multiple threads, and see whether taking a lock around MethodDesc creation makes it measurably slower? If we find that it is getting measurably slower, we may want to do something else about it - e.g. move more of the work outside the lock. |
|
I used a script to create a Gen.cs file and test the creation of 150k different method instantiations. public static class W
{
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public static long M1<T>() => System.Runtime.CompilerServices.Unsafe.SizeOf<T>();
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public static long M2<T>() => System.Runtime.CompilerServices.Unsafe.SizeOf<T>();
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public static long M3<T>() => System.Runtime.CompilerServices.Unsafe.SizeOf<T>();
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public static long M4<T>() => System.Runtime.CompilerServices.Unsafe.SizeOf<T>();
[System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
public static long M5<T>() => System.Runtime.CompilerServices.Unsafe.SizeOf<T>();
}
public struct S0 { }
public struct S1 { }
public struct S2 { }
public struct S3 { }
public struct S4 { }
// ...
public struct S29998 { }
public struct S29999 { }
public static class Gen
{
public static readonly System.Action[] Actions = new System.Action[]
{
static () => { W.M1<S0>(); W.M2<S0>(); W.M3<S0>(); W.M4<S0>(); W.M5<S0>(); },
static () => { W.M1<S1>(); W.M2<S1>(); W.M3<S1>(); W.M4<S1>(); W.M5<S1>(); },
static () => { W.M1<S2>(); W.M2<S2>(); W.M3<S2>(); W.M4<S2>(); W.M5<S2>(); },
static () => { W.M1<S3>(); W.M2<S3>(); W.M3<S3>(); W.M4<S3>(); W.M5<S3>(); },
static () => { W.M1<S4>(); W.M2<S4>(); W.M3<S4>(); W.M4<S4>(); W.M5<S4>(); },
// ...
static () => { W.M1<S29998>(); W.M2<S29998>(); W.M3<S29998>(); W.M4<S29998>(); W.M5<S29998>(); },
static () => { W.M1<S29999>(); W.M2<S29999>(); W.M3<S29999>(); W.M4<S29999>(); W.M5<S29999>(); },
}
}The main programs just runs a loop Parallel.For(0, N, new ParallelOptions { MaxDegreeOfParallelism = Workers }, i => actions[i]());net10 takes ~690 ms while .net10 + this change takes ~705 ms (around 2% slower in this extreme case). I think this should be fine. |
I was working with the test in https://github.com/korchak-aleksandr/net10-regression-repro and found out that the free list in UnlockedLoaderHeap grows to thousands of elements, which makes allocations very slow since we do a linear scan of this free list for each one of them.
In these scenarios multiple threads might need the same generic instantiation simultaneously and they all race to create/publish it. Multiple threads can lose the race and quickly add blocks to the free list since they don't need that memory. Subsequent calls that need generic instantiations do a linear scan of the free list to find a memory block to reuse. This ends up taking a lot of time due to its size (can be up to ten of thousands).
This PR stops making thread race to create/publish such that we don't insert several blocks in the free list.