Fix late selective_channel retry after EndRPC#3359
Conversation
c2c7d03 to
e27a7e9
Compare
|
@chenBright hello, please help review |
| _main_cntl->SetFailed(ECANCELED, | ||
| "SelectiveChannel balancer is unavailable"); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
The problem seems unresolved. The process still crashes even if the balancer is deleted by main_cntl after the if statement.
There was a problem hiding this comment.
@chenBright thanks for review, The previous null check was not enough because it did not keep the balancer alive after _lb.get().
I updated the patch so Sender holds its own intrusive_ptr<SharedLoadBalancer> captured from SelectiveChannel at creation time, and IssueRPC() uses that retained reference. This prevents main_cntl->_lb.reset() in EndRPC() from freeing the balancer while IssueRPC() is still using it.
The FLAGS_ENDING_RPC guard is kept to prevent late SubDone from re-entering retry / backup after the main RPC starts ending.
e27a7e9 to
531f2d7
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a SelectiveChannel race where late SubDone::Run() callbacks could re-enter retry/backup logic after the main RPC has entered Controller::EndRPC(), potentially operating on partially torn-down controller state (including a previously observed null balancer path).
Changes:
- Introduces an “ending RPC” controller flag set at the beginning of
Controller::EndRPC()and ignores lateOnVersionedRPCReturned()callbacks once teardown has begun. - Makes
schan::Senderhold its own reference to the load balancer and adds a defensive null check before selecting a sub-channel. - Adds a regression test that forces the main RPC to time out while delayed sub-calls finish later, validating late-callback behavior under retry/backup configuration.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
test/brpc_channel_unittest.cpp |
Adds a regression test exercising late sub-call completion after main timeout with retry + backup enabled. |
src/brpc/selective_channel.cpp |
Passes the balancer into schan::Sender and adds a defensive null check before using it. |
src/brpc/controller.h |
Adds FLAGS_ENDING_RPC and a helper accessor to expose “ending RPC” state. |
src/brpc/controller.cpp |
Sets the “ending RPC” flag early in EndRPC() and short-circuits late callbacks in OnVersionedRPCReturned(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!initialized()) { | ||
| cntl->SetFailed(EINVAL, "SelectiveChannel=%p is not initialized yet", | ||
| this); | ||
| } | ||
| schan::Sender* sndr = new schan::Sender(cntl, request, response, user_done); | ||
| schan::Sender* sndr = |
There was a problem hiding this comment.
Addressed. SelectiveChannel::CallMethod() now returns immediately when the channel is not initialized, and runs done in-place if provided, matching the behavior of PartitionChannel.
This avoids allocating Sender or dispatching into the inner channel with an unavailable balancer. I also added a regression test covering both sync and async uninitialized calls.
There was a problem hiding this comment.
Since lb already belongs to a Sender, then in Sender::IssueRPC, _main_cntl->_lb.get() should be changed to _lb.get(), otherwise this part looks a bit confusing.
There was a problem hiding this comment.
@yanglimingcn This has already been addressed in the latest patch. Sender::IssueRPC() now uses the Sender-owned _lb.get() and calls SelectChannel() through the local balancer pointer, instead of reading _main_cntl->_lb.get() directly.
4217a8a to
ee4d47a
Compare
|
@chenBright @yanglimingcn comments updated,thanks for another review |
Ignore late selective_channel SubDone callbacks once the main RPC enters EndRPC, and let Sender hold the selective balancer while issuing sub-RPCs. Return early when SelectiveChannel is called before Init, preserving the intended EINVAL result. Add regression tests covering timeout plus delayed sub-call completion and uninitialized SelectiveChannel calls.
ee4d47a to
2587a7f
Compare
What problem does this PR solve?
Issue Number: #3358
SelectiveChannelhas a use-after-teardown race: a lateSubDone::Run()canre-enter the retry/backup path after the main RPC has already run
EndRPC().EndRPC()resets the controller's load balancer (_lb), but a sub-call thatfinishes slightly later still flows through:
so it retries on partially torn-down state — the null-balancer crash in #3358
(
SharedLoadBalancer::SelectServer(this=0x0)). This is a distinct race from #3294.Trigger:
backup_request_ms > 0,max_retry > 0, and several sub-RPCs endingalmost simultaneously (timeout / connection failure) so a sub-call's callback
lands just after the main RPC's
EndRPC().What is changed and the side effects?
Changed:
EndRPC()sets a newFLAGS_ENDING_RPCflag at its very start.OnVersionedRPCReturned()checksis_ending_rpc()and ignores lateSubDonecallbacks instead of re-entering retry/backup.
schan::Sender::IssueRPC()keeps a defensive null check on the balancer as alast-resort guard at the
selective_channelboundary.Test:
Added a regression test using
SelectiveChannelwith backup request + retry thatlets the main RPC time out first and delayed sub-calls finish afterward —
reproducing the late-callback window and asserting it no longer re-enters
retry/backup. Re-ran the related selective / backup-request tests.
Side effects:
teardown path; the hot path is unchanged.
Cluster validation
Also validated on a 3-node, production-like cluster issuing periodic
multi-address meta RPCs over
SelectiveChannel(backup_request_msset,max_retry=3, short timeout). With the target on a remote node, one-way egressdelay (
tc netem delay 150ms) was injected while the per-RPC timeout stayed at50ms, so the main RPC enters
EndRPC()while backup/retry sub-RPCs are still inflight and their
SubDone::Run()lands afterward.The fix is silent by design (a late
SubDonejust hits theis_ending_rpc()guard and is dropped), so to confirm the window was actually exercised I
temporarily added a
LOG(WARNING)at the guard. Over a 4-minute injection itfired 2137 times with 0 crashes and 0 new coredumps, and the cluster
recovered automatically once the delay was cleared. Without this fix each of
those late
SubDones would have re-entered retry/backup and dereferenced thealready-reset
_lb. (The log was temporary instrumentation, not part of this PR.)Check List:
Fixes: #3358