Fix NativeAOT GC roots after universal transition#127640
Fix NativeAOT GC roots after universal transition#127640MichalStrehovsky merged 4 commits intodotnet:mainfrom
Conversation
When a GC stack walk starts from a hijacked universal-transition frame, the iterator unwinds through the thunk and then yields the managed caller at the post-call IP. That caller is not actually the active frame yet, so reporting scratch registers from its post-call GC info can expose stale thunk state. In the failing System.Linq.Tests NativeAOT case, the precise GC root came from REGDISPLAY.pRax while CoffNativeCodeManager::EnumGcRefs was called with isActiveStackFrame=true. RAX contained the resolved interface dispatch target, System.Linq.Enumerable.Iterator<int>.System.Collections.IEnumerator.get_Current, so object validation treated a code pointer as a GC object and fail-fast asserted. Clear ActiveStackFrame after unwinding the non-EH universal-transition thunk sequence so the yielded managed caller still reports its non-scratch roots and the conservative thunk range, but does not report scratch registers until the thunk has completed. Validation: before the fix, the parallel System.Linq.Tests NativeAOT stress loop completed 69 runs with 63 successes and 6 fail-fast crashes; sampled dumps all showed the same pRax code-pointer root. After rebuilding with this fix, the same loop ran for 612.3 seconds at parallelism 4 and completed 132 runs with 132 successes, 0 crashes, and 0 test failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib |
|
If anyone wants to follow at home, The fix is from GPT-5.5, the analysis is from Claude 4.7. I liked Claude's analysis better, but then it went into the weeds trying to come up with a fix. GPT-5.5 came up with the same root cause and had a good looking fix. Root causeA NativeAOT GC return-address hijack on What the dump shows
The chain of events
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a NativeAOT GC stack-walk correctness issue when the walk is initiated from a hijacked universal transition thunk frame. After unwinding through the thunk to the next managed caller frame, the iterator could previously still mark that managed frame as the active frame, causing scratch registers to be reported using the caller’s post-call GC state. In the reported failure mode, that exposed stale thunk register contents (e.g., a code pointer in RAX) as a “precise GC root”, leading to fail-fast object validation.
Changes:
- Clear
ActiveStackFrameafter unwinding a non-EH universal-transition thunk sequence, so the yielded managed caller frame is not treated as the active frame for scratch-register reporting. - Preserve the existing behavior of publishing the conservative stack range computed while unwinding the thunk sequence.
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
When a GC stack walk starts from a hijacked universal-transition frame, the iterator unwinds through the thunk and then yields the managed caller at the post-call IP. That caller is not actually the active frame yet, so reporting scratch registers from its post-call GC info can expose stale thunk state.
In the failing System.Linq.Tests NativeAOT case, the precise GC root came from REGDISPLAY.pRax while CoffNativeCodeManager::EnumGcRefs was called with isActiveStackFrame=true. RAX contained the resolved interface dispatch target, System.Linq.Enumerable.Iterator.System.Collections.IEnumerator.get_Current, so object validation treated a code pointer as a GC object and fail-fast asserted.
Clear ActiveStackFrame after unwinding the non-EH universal-transition thunk sequence so the yielded managed caller still reports its non-scratch roots and the conservative thunk range, but does not report scratch registers until the thunk has completed.
Validation: before the fix, the parallel System.Linq.Tests NativeAOT stress loop completed 69 runs with 63 successes and 6 fail-fast crashes; sampled dumps all showed the same pRax code-pointer root. After rebuilding with this fix, the same loop ran for 612.3 seconds at parallelism 4 and completed 132 runs with 132 successes, 0 crashes, and 0 test failures.