Fix late selective_channel retry after EndRPC by hjwsm1989 · Pull Request #3359 · apache/brpc

hjwsm1989 · 2026-06-25T06:00:15Z

What problem does this PR solve?

Issue Number: #3358

SelectiveChannel has a use-after-teardown race: a late SubDone::Run() can
re-enter the retry/backup path after the main RPC has already run EndRPC().

EndRPC() resets the controller's load balancer (_lb), but a sub-call that
finishes slightly later still flows through:

SubDone::Run
  -> Controller::OnVersionedRPCReturned()
     -> Controller::IssueRPC()
        -> schan::Sender::IssueRPC()   // reads _main_cntl->_lb, now null

so it retries on partially torn-down state — the null-balancer crash in #3358
(SharedLoadBalancer::SelectServer(this=0x0)). This is a distinct race from #3294.

Trigger: backup_request_ms > 0, max_retry > 0, and several sub-RPCs ending
almost simultaneously (timeout / connection failure) so a sub-call's callback
lands just after the main RPC's EndRPC().

What is changed and the side effects?

Changed:

EndRPC() sets a new FLAGS_ENDING_RPC flag at its very start.
OnVersionedRPCReturned() checks is_ending_rpc() and ignores late SubDone
callbacks instead of re-entering retry/backup.
schan::Sender::IssueRPC() keeps a defensive null check on the balancer as a
last-resort guard at the selective_channel boundary.

Test:
Added a regression test using SelectiveChannel with backup request + retry that
lets the main RPC time out first and delayed sub-calls finish afterward —
reproducing the late-callback window and asserting it no longer re-enters
retry/backup. Re-ran the related selective / backup-request tests.

Side effects:

Performance: none. The guard is a single flag check on the already-cold
teardown path; the hot path is unchanged.
Breaking backward compatibility: none.

Cluster validation

Also validated on a 3-node, production-like cluster issuing periodic
multi-address meta RPCs over SelectiveChannel (backup_request_ms set,
max_retry=3, short timeout). With the target on a remote node, one-way egress
delay (tc netem delay 150ms) was injected while the per-RPC timeout stayed at
50ms, so the main RPC enters EndRPC() while backup/retry sub-RPCs are still in
flight and their SubDone::Run() lands afterward.

The fix is silent by design (a late SubDone just hits the is_ending_rpc()
guard and is dropped), so to confirm the window was actually exercised I
temporarily added a LOG(WARNING) at the guard. Over a 4-minute injection it
fired 2137 times with 0 crashes and 0 new coredumps, and the cluster
recovered automatically once the delay was cleared. Without this fix each of
those late SubDones would have re-entered retry/backup and dereferenced the
already-reset _lb. (The log was temporary instrumentation, not part of this PR.)

Check List:

Changes are compilable.
Added a regression test for the new behavior.
Follows the Contributor Covenant Code of Conduct.

Fixes: #3358

hjwsm1989 · 2026-06-25T06:42:41Z

@chenBright hello, please help review

chenBright · 2026-06-27T09:03:53Z

+        _main_cntl->SetFailed(ECANCELED,
+                              "SelectiveChannel balancer is unavailable");
+        return -1;
+    }


The problem seems unresolved. The process still crashes even if the balancer is deleted by main_cntl after the if statement.

@chenBright thanks for review, The previous null check was not enough because it did not keep the balancer alive after _lb.get().

I updated the patch so Sender holds its own intrusive_ptr<SharedLoadBalancer> captured from SelectiveChannel at creation time, and IssueRPC() uses that retained reference. This prevents main_cntl->_lb.reset() in EndRPC() from freeing the balancer while IssueRPC() is still using it.

The FLAGS_ENDING_RPC guard is kept to prevent late SubDone from re-entering retry / backup after the main RPC starts ending.

Copilot

Pull request overview

This PR addresses a SelectiveChannel race where late SubDone::Run() callbacks could re-enter retry/backup logic after the main RPC has entered Controller::EndRPC(), potentially operating on partially torn-down controller state (including a previously observed null balancer path).

Changes:

Introduces an “ending RPC” controller flag set at the beginning of Controller::EndRPC() and ignores late OnVersionedRPCReturned() callbacks once teardown has begun.
Makes schan::Sender hold its own reference to the load balancer and adds a defensive null check before selecting a sub-channel.
Adds a regression test that forces the main RPC to time out while delayed sub-calls finish later, validating late-callback behavior under retry/backup configuration.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`test/brpc_channel_unittest.cpp`	Adds a regression test exercising late sub-call completion after main timeout with retry + backup enabled.
`src/brpc/selective_channel.cpp`	Passes the balancer into `schan::Sender` and adds a defensive null check before using it.
`src/brpc/controller.h`	Adds `FLAGS_ENDING_RPC` and a helper accessor to expose “ending RPC” state.
`src/brpc/controller.cpp`	Sets the “ending RPC” flag early in `EndRPC()` and short-circuits late callbacks in `OnVersionedRPCReturned()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hjwsm1989 · 2026-06-27T10:17:13Z

    if (!initialized()) {
        cntl->SetFailed(EINVAL, "SelectiveChannel=%p is not initialized yet",
                        this);
    }
-    schan::Sender* sndr = new schan::Sender(cntl, request, response, user_done);
+    schan::Sender* sndr =


Addressed. SelectiveChannel::CallMethod() now returns immediately when the channel is not initialized, and runs done in-place if provided, matching the behavior of PartitionChannel.

This avoids allocating Sender or dispatching into the inner channel with an unavailable balancer. I also added a regression test covering both sync and async uninitialized calls.

Since lb already belongs to a Sender, then in Sender::IssueRPC, _main_cntl->_lb.get() should be changed to _lb.get(), otherwise this part looks a bit confusing.

@yanglimingcn This has already been addressed in the latest patch. Sender::IssueRPC() now uses the Sender-owned _lb.get() and calls SelectChannel() through the local balancer pointer, instead of reading _main_cntl->_lb.get() directly.

hjwsm1989 · 2026-06-29T06:52:51Z

@chenBright @yanglimingcn comments updated，thanks for another review

Ignore late selective_channel SubDone callbacks once the main RPC enters EndRPC, and let Sender hold the selective balancer while issuing sub-RPCs. Return early when SelectiveChannel is called before Init, preserving the intended EINVAL result. Add regression tests covering timeout plus delayed sub-call completion and uninitialized SelectiveChannel calls.

chenBright

LGTM

hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from c2c7d03 to e27a7e9 Compare June 25, 2026 06:01

chenBright reviewed Jun 27, 2026

View reviewed changes

hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from e27a7e9 to 531f2d7 Compare June 27, 2026 09:41

chenBright requested a review from Copilot June 27, 2026 09:53

Copilot started reviewing on behalf of chenBright June 27, 2026 09:53 View session

Copilot AI reviewed Jun 27, 2026

View reviewed changes

hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch 2 times, most recently from 4217a8a to ee4d47a Compare June 29, 2026 05:43

chenBright requested a review from Copilot June 30, 2026 02:13

This comment was marked as low quality.

Sign in to view

chenBright reviewed Jun 30, 2026

View reviewed changes

Comment thread src/brpc/selective_channel.cpp Outdated

hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from ee4d47a to 2587a7f Compare June 30, 2026 02:58

chenBright approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix late selective_channel retry after EndRPC#3359

Fix late selective_channel retry after EndRPC#3359
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix

hjwsm1989 commented Jun 25, 2026 •

edited

Loading

Uh oh!

hjwsm1989 commented Jun 25, 2026

Uh oh!

chenBright Jun 27, 2026

Uh oh!

hjwsm1989 Jun 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hjwsm1989 Jun 27, 2026

Uh oh!

yanglimingcn Jun 29, 2026

Uh oh!

hjwsm1989 Jun 29, 2026

Uh oh!

hjwsm1989 commented Jun 29, 2026

Uh oh!

This comment was marked as low quality.

Uh oh!

Uh oh!

chenBright left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

hjwsm1989 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and the side effects?

Cluster validation

Check List:

Uh oh!

hjwsm1989 commented Jun 25, 2026

Uh oh!

chenBright Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

hjwsm1989 Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hjwsm1989 Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

yanglimingcn Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

hjwsm1989 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

hjwsm1989 commented Jun 29, 2026

Uh oh!

This comment was marked as low quality.

Uh oh!

Uh oh!

chenBright left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hjwsm1989 commented Jun 25, 2026 •

edited

Loading