Skip to content

fix(ruler): stop per-tenant notifier when rule manager creation fails#7597

Open
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/ruler-notifier-leak-on-managerfactory-error
Open

fix(ruler): stop per-tenant notifier when rule manager creation fails#7597
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix/ruler-notifier-leak-on-managerfactory-error

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

Fixes a per-tenant resource leak in the ruler's DefaultMultiTenantManager (issue #7595).

newManager (pkg/ruler/manager.go) starts a per-user notifier before the rule manager exists: getOrCreateNotifier calls n.run() — which spawns the Alertmanager service-discovery and notification goroutines — and registers the notifier in r.notifiers. It also adds a per-user metrics registry via AddUserRegistry. Only after that does it call managerFactory.

If managerFactory returns an error, createRulesManager logs and returns nil, so the user is never added to r.userManagers. The only place a notifier is stopped during normal operation is removeNotifier, called from the user-deletion loop in SyncRuleGroups — which iterates r.userManagers. Because the failed user isn't there, that cleanup never runs: the notifier, its two goroutines, and the per-user metrics registry leak until the process exits.

This PR tears the partially-initialized state down on any early return from newManager, and fixes the same class of leak on the applyConfig error path inside getOrCreateNotifier.

Why this is reachable

DefaultTenantManagerFactory returns real errors in normal operation — e.g. resolveFrontendClient fails, or neither an internal PromQL engine nor a query-frontend client is configured (pkg/ruler/compat.go). A misconfigured or transiently-unresolvable ruler hits this on every sync for the affected tenant; under tenant churn the leaked notifiers, goroutines and registries accumulate and grow memory unbounded.

It was introduced by #6151, which changed ManagerFactory to return an error and made newManager propagate it — without adding cleanup for the notifier it had just started. (The earlier #4718 added removeNotifier only to the user-removal path.)

How the fix resolves it

In newManager, a success flag + defer runs the cleanup (removeNotifier(userID) + RemoveUserRegistry(userID)) on any early return and is disarmed only once a manager is successfully returned, so future early-returns can't reintroduce the leak. This mirrors the existing cleanup in the SyncRuleGroups removal loop. It is deadlock-free: newManager runs under userManagerMtx and removeNotifier takes notifiersMtx; those two locks are always acquired in that order and never the reverse.

In getOrCreateNotifier, if applyConfig fails after n.run(), the notifier was started but not yet inserted into r.notifiers, so it is stopped directly with n.stop() (not removeNotifier, which would re-acquire the already-held notifiersMtx). There is no double-stop: a not-yet-inserted notifier is invisible to the newManager defer's removeNotifier.

Tests

Three regression tests in pkg/ruler/manager_test.go, each verified to fail without the fix and pass with it (under -race, high -count):

  • TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError — drives the real SyncRuleGroupscreateRulesManagernewManager path with a failing factory; asserts the user is not managed, the notifier is removed from r.notifiers, the per-user registry is removed, and the notifier-run goroutines return to baseline.
  • TestGetOrCreateNotifierStopsNotifierOnApplyConfigError — forces applyConfig to fail (bad Alertmanager TLS CA file) and asserts the notifier and its goroutines are stopped.
  • TestSyncRuleGroupsRecoversAfterManagerFactoryError — a user whose first creation failed is created normally once the factory later succeeds (the cleanup doesn't leave it unrecoverable).

Goroutine leakage is asserted with a pprof-scoped helper that counts only (*rulerNotifier).run goroutines (robust against unrelated goroutines and closure inlining), using a baseline delta.

Why not other approaches

  • Explicit cleanup on each error branch (instead of the disarm-defer): equivalent today, but the defer also covers a panic in managerFactory and any future early-return added before the manager is returned.
  • Stopping the notifier from SyncRuleGroups: the failed user never enters r.userManagers, so the existing removal loop can't see it; the cleanup has to live where the notifier was created.

Scope

Production change is two small edits in pkg/ruler/manager.go; tests in pkg/ruler/manager_test.go; one CHANGELOG.md line. No flags, config, or behavior change beyond the added cleanup.

Two related items are intentionally out of scope (noted for follow-up):

  • The pre-existing stale-notifier reuse branch in getOrCreateNotifier (which re-calls run()) becomes unreachable for the failure scenario after this fix; its now-misleading comment is left untouched to keep this PR tight.
  • syncRulesToManager also allocates mapper rule files / external-label map entries / a lastReloadSuccessful series before manager creation; those likewise persist for a perpetually-failing user. That is a distinct (bounded, non-goroutine) leak and a separate change.

Which issue(s) this PR fixes

Fixes #7595

Checklist

  • CHANGELOG.md updated — [BUGFIX] Ruler entry.
  • Documentation updated — N/A; no flags or config changed (make doc not required).
  • Tests: three regression tests added, each fails without the fix.
  • Commit signed off (DCO).

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l — clean
  • go vet -tags "netgo slicelabels" ./pkg/ruler/ — clean
  • go test -tags "netgo slicelabels" -race -count=10 -run 'TestSyncRuleGroupsCleansUpNotifierOnManagerFactoryError|TestGetOrCreateNotifierStopsNotifierOnApplyConfigError|TestSyncRuleGroupsRecoversAfterManagerFactoryError' ./pkg/ruler/ — PASS
  • go test -tags "netgo slicelabels" -count=1 ./pkg/ruler/ — PASS (full package, ~35s)
  • Reverting manager.go to master makes all three new tests fail.

🤖 Generated with Claude Code

newManager started the per-user notifier (and its discovery/notification
goroutines) and registered it in r.notifiers before creating the rule
manager. When managerFactory returned an error the user was never added to
r.userManagers, so the removal loop in SyncRuleGroups never stopped the
notifier, leaking it and its goroutines until the process exited. The same
applied to the applyConfig error path in getOrCreateNotifier.

Tear down the partially-initialized notifier and per-user metrics registry
on any early return from newManager (disarmed on success), and stop the
notifier directly when applyConfig fails. Adds regression tests for both
leak paths.

Fixes cortexproject#7595

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/ruler-notifier-leak-on-managerfactory-error branch from 747c0e4 to 612ad2f Compare June 7, 2026 12:15
The lint job's check-modernize step failed because the notifier-leak
regression tests used pre-modern idioms that `make modernize`
(golang.org/x/tools modernize@v0.22.0) rewrites:

- func() interface{} -> func() any (3x)
- for _, x := range bytes.Split(...) -> for x := range bytes.SplitSeq(...)

Apply exactly what the pinned modernize tool produces so
`make check-modernize` reports a clean tree. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 marked this pull request as ready for review June 7, 2026 13:09
@dosubot dosubot Bot added component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. type/bug labels Jun 7, 2026
@sandy2008 sandy2008 closed this Jun 7, 2026
@sandy2008 sandy2008 reopened this Jun 7, 2026
Comment thread pkg/ruler/manager.go
@@ -374,6 +399,12 @@ func (r *DefaultMultiTenantManager) getOrCreateNotifier(userID string, userManag

// This should never fail, unless there's a programming mistake.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is ambiguous.. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. size/L type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ruler: per-tenant notifier and its service-discovery goroutines leak when managerFactory returns an error

2 participants