fix(ruler): stop per-tenant notifier when rule manager creation fails#7597
Open
sandy2008 wants to merge 2 commits into
Open
fix(ruler): stop per-tenant notifier when rule manager creation fails#7597sandy2008 wants to merge 2 commits into
sandy2008 wants to merge 2 commits into
Conversation
newManager started the per-user notifier (and its discovery/notification goroutines) and registered it in r.notifiers before creating the rule manager. When managerFactory returned an error the user was never added to r.userManagers, so the removal loop in SyncRuleGroups never stopped the notifier, leaking it and its goroutines until the process exited. The same applied to the applyConfig error path in getOrCreateNotifier. Tear down the partially-initialized notifier and per-user metrics registry on any early return from newManager (disarmed on success), and stop the notifier directly when applyConfig fails. Adds regression tests for both leak paths. Fixes cortexproject#7595 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
747c0e4 to
612ad2f
Compare
The lint job's check-modernize step failed because the notifier-leak
regression tests used pre-modern idioms that `make modernize`
(golang.org/x/tools modernize@v0.22.0) rewrites:
- func() interface{} -> func() any (3x)
- for _, x := range bytes.Split(...) -> for x := range bytes.SplitSeq(...)
Apply exactly what the pinned modernize tool produces so
`make check-modernize` reports a clean tree. No behavior change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
SungJin1212
reviewed
Jun 8, 2026
| @@ -374,6 +399,12 @@ func (r *DefaultMultiTenantManager) getOrCreateNotifier(userID string, userManag | |||
|
|
|||
| // This should never fail, unless there's a programming mistake. | |||
Member
There was a problem hiding this comment.
I think this comment is ambiguous.. wdyt?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Fixes a per-tenant resource leak in the ruler's
DefaultMultiTenantManager(issue #7595).newManager(pkg/ruler/manager.go) starts a per-user notifier before the rule manager exists:getOrCreateNotifiercallsn.run()— which spawns the Alertmanager service-discovery and notification goroutines — and registers the notifier inr.notifiers. It also adds a per-user metrics registry viaAddUserRegistry. Only after that does it callmanagerFactory.If
managerFactoryreturns an error,createRulesManagerlogs and returnsnil, so the user is never added tor.userManagers. The only place a notifier is stopped during normal operation isremoveNotifier, called from the user-deletion loop inSyncRuleGroups— which iteratesr.userManagers. Because the failed user isn't there, that cleanup never runs: the notifier, its two goroutines, and the per-user metrics registry leak until the process exits.This PR tears the partially-initialized state down on any early return from
newManager, and fixes the same class of leak on theapplyConfigerror path insidegetOrCreateNotifier.Why this is reachable
DefaultTenantManagerFactoryreturns real errors in normal operation — e.g.resolveFrontendClientfails, or neither an internal PromQL engine nor a query-frontend client is configured (pkg/ruler/compat.go). A misconfigured or transiently-unresolvable ruler hits this on every sync for the affected tenant; under tenant churn the leaked notifiers, goroutines and registries accumulate and grow memory unbounded.It was introduced by #6151, which changed
ManagerFactoryto return an error and madenewManagerpropagate it — without adding cleanup for the notifier it had just started. (The earlier #4718 addedremoveNotifieronly to the user-removal path.)How the fix resolves it
In
newManager, asuccessflag +deferruns the cleanup (removeNotifier(userID)+RemoveUserRegistry(userID)) on any early return and is disarmed only once a manager is successfully returned, so future early-returns can't reintroduce the leak. This mirrors the existing cleanup in theSyncRuleGroupsremoval loop. It is deadlock-free:newManagerruns underuserManagerMtxandremoveNotifiertakesnotifiersMtx; those two locks are always acquired in that order and never the reverse.In
getOrCreateNotifier, ifapplyConfigfails aftern.run(), the notifier was started but not yet inserted intor.notifiers, so it is stopped directly withn.stop()(notremoveNotifier, which would re-acquire the already-heldnotifiersMtx). There is no double-stop: a not-yet-inserted notifier is invisible to thenewManagerdefer'sremoveNotifier.Tests
Three regression tests in
pkg/ruler/manager_test.go, each verified to fail without the fix and pass with it (under-race, high-count):TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError— drives the realSyncRuleGroups→createRulesManager→newManagerpath with a failing factory; asserts the user is not managed, the notifier is removed fromr.notifiers, the per-user registry is removed, and the notifier-run goroutines return to baseline.TestGetOrCreateNotifierStopsNotifierOnApplyConfigError— forcesapplyConfigto fail (bad Alertmanager TLS CA file) and asserts the notifier and its goroutines are stopped.TestSyncRuleGroupsRecoversAfterManagerFactoryError— a user whose first creation failed is created normally once the factory later succeeds (the cleanup doesn't leave it unrecoverable).Goroutine leakage is asserted with a pprof-scoped helper that counts only
(*rulerNotifier).rungoroutines (robust against unrelated goroutines and closure inlining), using a baseline delta.Why not other approaches
defer): equivalent today, but thedeferalso covers a panic inmanagerFactoryand any future early-return added before the manager is returned.SyncRuleGroups: the failed user never entersr.userManagers, so the existing removal loop can't see it; the cleanup has to live where the notifier was created.Scope
Production change is two small edits in
pkg/ruler/manager.go; tests inpkg/ruler/manager_test.go; oneCHANGELOG.mdline. No flags, config, or behavior change beyond the added cleanup.Two related items are intentionally out of scope (noted for follow-up):
getOrCreateNotifier(which re-callsrun()) becomes unreachable for the failure scenario after this fix; its now-misleading comment is left untouched to keep this PR tight.syncRulesToManageralso allocates mapper rule files / external-label map entries / alastReloadSuccessfulseries before manager creation; those likewise persist for a perpetually-failing user. That is a distinct (bounded, non-goroutine) leak and a separate change.Which issue(s) this PR fixes
Fixes #7595
Checklist
CHANGELOG.mdupdated —[BUGFIX] Rulerentry.make docnot required).Test plan
gofmt -l/goimports -local github.com/cortexproject/cortex -l— cleango vet -tags "netgo slicelabels" ./pkg/ruler/— cleango test -tags "netgo slicelabels" -race -count=10 -run 'TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError|TestGetOrCreateNotifierStopsNotifierOnApplyConfigError|TestSyncRuleGroupsRecoversAfterManagerFactoryError' ./pkg/ruler/— PASSgo test -tags "netgo slicelabels" -count=1 ./pkg/ruler/— PASS (full package, ~35s)manager.gotomastermakes all three new tests fail.🤖 Generated with Claude Code