fix!(core): Make HubSwitchGuard !Send to prevent thread corruption#957
fix!(core): Make HubSwitchGuard !Send to prevent thread corruption#957szokeasaurusrex merged 10 commits intomasterfrom
Conversation
21b407b to
1dea844
Compare
e78483a to
8abf004
Compare
8abf004 to
1255c8a
Compare
0976b79 to
e4ddd01
Compare
|
This current implementation is still flawed; if the same span is re-entered in the same thread (possible in async contexts), the second entry overwrites the original HubSwitchGuard, corrupting the behavior. I am working on a fix |
|
@szokeasaurusrex it rewrites the original guard with the same one I think, so what's the issue there? |
|
I've tested some scenarios manually and didn't find any issues. I assume you're also testing manully. |
|
@lcian Codex AI agent identified the potential issue here as I was working on #946. I have a test locally which reproduces the behavior. Will commit it with a more detailed explanation of the problem tomorrow 👍 I suppose it's possible that Codex is wrong and the local test is doing something which is not supposed to ever happen (the scenario is admittedly a bit contrived), but in any case, I think it's worth it to investigate properly. So, that is what I'm doing now. I will let you know if I need any assistance |
e4ddd01 to
6169b08
Compare
szokeasaurusrex
left a comment
There was a problem hiding this comment.
I have made some pretty substantial changes here. Would appreciate a re-review.
Please also see the updated description 🙏
| { | ||
| let _enter_b = span_b.enter(); | ||
| } | ||
| } |
There was a problem hiding this comment.
@lcian This reproduces the bug I was talking about. With the original implementation, span_a and span_b ended up as two separate transactions because dropping _reenter_a restored the original hub. There were two underlying issues here that were causing this behavior; I fixed both of them here.
The first fix was that I changed the SPAN_GUARDS so that it would not just have a single guard per span, but rather, to have a stack of guards that are pushed on each entry and popped on each exit. This way, we only restored the original hub on the final exit from the span.
That first fix was only part of the problem, though. To get the span structure correct here, I also needed to fix #946. For that, we needed to be forking the hub, as you suggested last week. Otherwise, if we just use the hub on the span directly, we end up not actually getting proper scope isolation.
PR is substantially change, should be re-reviewed
|
Sounds good @szokeasaurusrex, thank you. Yep, that seems an acceptable overhead. |
cc620c7 to
3da679d
Compare
) ## Description This PR does three important things, which are all interdependent, so I think it makes sense to do it all in a single PR. ### (1) Making `HubSwitchGuard` `!Send` This PR makes `HubSwitchGuard` `!Send` by adding `PhantomData<MutexGuard<'static, ()>>` while keeping it `Sync`. The type system now prevents the guard from being moved across threads at compile time. This change is important because `HubSwitchGuard` manages thread-local hub state, but was previously `Send`, allowing it to be moved to another thread. When dropped on the wrong thread, it could corrupt that thread's hub state instead of restoring the original thread. This change resolves #943. **Tests related to this change:** `!Send` is enforced by the compiler, so there are no additional tests here. ### (2) A new stack for spans to manage `HubSwitchGuards` As `HubSwitchGuard` is now `!Send`, we needed a new way to manage the `HubSwitchGuard` associated with a given span, in the `tracing` integration. Previously, we just had the guards live directly on the span via the `SentrySpanData` type, but this no longer works, since spans need to be `Send`. So, we now declare a thread-local mapping from span IDs to `HubSwitchGuard`s. The guards are stored in a stack, as each span will have one guard per span entry (spans can be entered multiple times, even on the same thread, and without the stack, we would restore the original hub too early). We drop the guards on span exit in LIFO order. **Tests related to this change:** [`span_reentrancy.rs`](https://github.com/getsentry/sentry-rust/pull/957/changes#diff-203b34b6e9f8f4bad9c8c43e05c781e119eea7b27648f257bfe7b35e912ba2b1) contains tests to validate correct span tree structure when re-entering the same span multiple times. A previous iteration of this PR only allowed a single guard per span; the test failed against this implementation because it produces an incorrect span structure (two transactions instead of one). The test only passes with the guard stack. ### (3) Forking the `Hub` on each span (re-)entry Change (2) is insufficient to make the [`span_reentrancy.rs`](https://github.com/getsentry/sentry-rust/pull/957/changes#diff-203b34b6e9f8f4bad9c8c43e05c781e119eea7b27648f257bfe7b35e912ba2b1) test pass because with only that change, there is still the fundamental problem that each span does not get its own `Hub` to manage state. Thus, in that test, I believe we were actually only ever using one hub, because there was no place we were forking the hub. So, on span exit, we prematurely [set the parent span on the hub](https://github.com/getsentry/sentry-rust/pull/957/changes#diff-5acb70e20dc764b608e1acf81b57fea59308624b7c2bc87906b310ff8b1f0eb2L372). Forking the hub ensures proper isolation, so the span gets set back at the right time (I also [suspect](https://github.com/getsentry/sentry-rust/pull/957/changes#r2773227449) we don't need to manually set it back, but I am unsure). This change resolves #946. **Tests related to this change:** [`span_cross_thread.rs`](https://github.com/getsentry/sentry-rust/pull/957/changes#diff-aa629e96442a0995ed2fd39dd68a18dd4be293732d2435b9aa5e03c848e12c38) ensures that entering the same span in multiple threads produces the correct span tree. Basically, it is a reproduction of #946. [`span_reentrancy.rs`](https://github.com/getsentry/sentry-rust/pull/957/changes#diff-203b34b6e9f8f4bad9c8c43e05c781e119eea7b27648f257bfe7b35e912ba2b1) also only passes with this change. ## Issues - Fixes #943 - Fixes [RUST-130](https://linear.app/getsentry/issue/RUST-130/hubswitchguard-should-not-be-send) - Fixes #946 - Fixes [RUST-132](https://linear.app/getsentry/issue/RUST-132/entering-the-same-span-several-times-causes-esoteric-behavior)
3da679d to
d0f44d8
Compare
|
@lcian would appreciate if you can check these commits; they contain the fixes for the span enter/exit mismatch detection/warning logging |
Co-authored-by: Lorenzo Cian <17258265+lcian@users.noreply.github.com>
lcian
left a comment
There was a problem hiding this comment.
LGTM, might be worth updating the changelog to briefly call out the new hub-forking behavior
(3) Forking the Hub on each span (re-)entry
Description
This PR does three important things, which are all interdependent, so I think it makes sense to do it all in a single PR.
(1) Making
HubSwitchGuard!SendThis PR makes
HubSwitchGuard!Sendby addingPhantomData<MutexGuard<'static, ()>>while keeping itSync. The type system now prevents the guard from being moved across threads at compile time.This change is important because
HubSwitchGuardmanages thread-local hub state, but was previouslySend, allowing it to be moved to another thread. When dropped on the wrong thread, it could corrupt that thread's hub state instead of restoring the original thread. This change resolves #943.Tests related to this change:
!Sendis enforced by the compiler, so there are no additional tests here.(2) A new stack for spans to manage
HubSwitchGuardsAs
HubSwitchGuardis now!Send, we needed a new way to manage theHubSwitchGuardassociated with a given span, in thetracingintegration. Previously, we just had the guards live directly on the span via theSentrySpanDatatype, but this no longer works, since spans need to beSend. So, we now declare a thread-local mapping from span IDs toHubSwitchGuards. The guards are stored in a stack, as each span will have one guard per span entry (spans can be entered multiple times, even on the same thread, and without the stack, we would restore the original hub too early). We drop the guards on span exit in LIFO order.Tests related to this change:
span_reentrancy.rscontains tests to validate correct span tree structure when re-entering the same span multiple times. A previous iteration of this PR only allowed a single guard per span; the test failed against this implementation because it produces an incorrect span structure (two transactions instead of one). The test only passes with the guard stack.(3) Forking the
Hubon each span (re-)entryChange (2) is insufficient to make the
span_reentrancy.rstest pass because with only that change, there is still the fundamental problem that each span does not get its ownHubto manage state. Thus, in that test, I believe we were actually only ever using one hub, because there was no place we were forking the hub. So, on span exit, we prematurely set the parent span on the hub.Forking the hub ensures proper isolation, so the span gets set back at the right time (I also suspect we don't need to manually set it back, but I am unsure).
This change resolves #946.
Tests related to this change:
span_cross_thread.rsensures that entering the same span in multiple threads produces the correct span tree. Basically, it is a reproduction of #946.span_reentrancy.rsalso only passes with this change.Issues