Skip to content

Include cluster UID in CONTROLLER_IDENTITY to prevent cross-cluster conflicts#307

Open
carlydf wants to merge 14 commits intomainfrom
dedupe-controller-id
Open

Include cluster UID in CONTROLLER_IDENTITY to prevent cross-cluster conflicts#307
carlydf wants to merge 14 commits intomainfrom
dedupe-controller-id

Conversation

@carlydf
Copy link
Copy Markdown
Collaborator

@carlydf carlydf commented Apr 27, 2026

Summary

  • Go: at startup, main() reads POD_NAMESPACE, fetches the namespace object via mgr.GetAPIReader(), and appends the namespace UID to CONTROLLER_IDENTITY (format: {release}/{namespace}/{namespace-uid}). No Helm changes required.
  • Go: removed the hardcoded "temporal-worker-controller" fallback from getControllerIdentity() — the env var is now required.
  • Migration: shouldClaimManagerIdentity is extended to reclaim ownership of Worker Deployments that were managed by an older controller version, so existing deployments are not orphaned after an upgrade.
  • Tests: use a dedicated testControllerIdentity constant (not the deprecated default) so tests exercise the normal identity path rather than the migration branch.

Why

Two controllers installed in different clusters but with the same release name and namespace shared the same identity string. That allowed them to both claim ownership of the same Temporal Worker Deployment. The controller's namespace UID is stable, unique per cluster, and already available without any Helm templating.

Appending the UID in Go (rather than in the Helm chart) means helm template | kubectl apply workflows work without any override — there is no offline vs. live-cluster split.

Migration

Upgrading changes the controller's identity, so existing Worker Deployments whose ManagerIdentity was set by an older version need to be reclaimed. Without handling this, the new controller would see a non-empty ManagerIdentity it doesn't recognise as its own, skip the claim step, and be unable to manage those deployments going forward.

Two legacy identity formats are handled:

Existing ManagerIdentity Reason
"temporal-worker-controller" Hardcoded default used before Helm set the identity
"release/namespace" Helm-set identity before the namespace UID was appended

Detection for the second case uses a prefix check: if the new identity is "release/namespace/{uid}" and the existing identity is "release/namespace", the new identity starts with existing + "/", so the controller knows it is looking at its own older identity and reclaims.

The old constant is retained in defaults as DeprecatedDefaultControllerIdentity with a comment explaining its purpose, rather than being deleted, to make the intent explicit.

Empty identity enforcement

An empty CONTROLLER_IDENTITY is dangerous in two ways:

  1. SetManagerIdentity with an empty string clears the ManagerIdentity field on the Worker Deployment, leaving it ownerless and claimable by any controller.
  2. Other write operations (SetCurrentVersion, SetRampingVersion, etc.) with an empty Identity cause the SDK to substitute its own default (typically hostname), producing junk in the Temporal audit trail.

Three layers of enforcement:

  • main() — fast-fail at process start for normal Helm deployments.
  • Reconcile() — fallback check for when the reconciler is imported as a library and main() is not in the call path (e.g. embedded controller managers, tests).
  • claimManagerIdentity() — last-resort guard specifically against the dangerous field-clear: refuses to call SetManagerIdentity if identity is empty.

Test plan

  • go build ./... passes
  • helm template renders without error and without any cluster lookup
  • Integration tests pass with CONTROLLER_IDENTITY set via t.Setenv
  • Existing Worker Deployments with old-format ManagerIdentity are reclaimed after upgrade

🤖 Generated with Claude Code

@carlydf carlydf requested review from a team and jlegrone as code owners April 27, 2026 20:54
carlydf and others added 10 commits April 27, 2026 13:57
Controllers that previously set ManagerIdentity as "release/namespace"
(pre-cluster-UID) or "temporal-worker-controller" (pre-Helm default)
would be permanently blocked from managing existing Worker Deployments
after upgrading, since shouldClaimManagerIdentity only triggered on
empty identity.

Now reclaims in both migration cases:
- exact match on the old hardcoded default
- new identity is a longer slash-prefixed version of the existing one

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… distinct test identity

- defaults.ControllerIdentity is now DeprecatedDefaultControllerIdentity to make
  clear it exists only for upgrade migration detection, not as a live default
- integration tests use a dedicated testControllerIdentity constant so they
  exercise the normal identity path rather than accidentally triggering the
  deprecated-identity migration branch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetManagerIdentity with an empty identity clears the ManagerIdentity
field on the Worker Deployment, leaving it ownerless. Add an explicit
check and return an error rather than making the call. The startup
check in main() is the primary enforcement; this is a targeted guard
for the one call site where an empty value is actively dangerous.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new clusterID values.yaml field and using the Helm lookup template function, how about just updating the controller Go code to always append the UID of the namespace that the controller is in to the end of the {release.name}/{release.namespace} existing value of the CONTROLLER_IDENTITY env var?

That way, there'd be no need for this new only-useful-when-running-helm-template values.yaml field and wouldn't need to change anything about the Helm chart at all.

Comment thread cmd/main.go
setupLog.Error(nil, "CONTROLLER_IDENTITY environment variable must be set")
os.Exit(1)
}
podNamespace := os.Getenv("POD_NAMESPACE")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already needed for WRT webhook (SA checks)

Comment thread cmd/main.go
Comment on lines +132 to +135
if err := os.Setenv(controller.IdentityEnvKey, os.Getenv(controller.IdentityEnvKey)+"/"+string(ns.UID)); err != nil {
setupLog.Error(err, "unable to set CONTROLLER_IDENTITY")
os.Exit(1)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i have a feeling that this won't work only because we don't have the worker-controller's RBAC policies updated

i checked here and noticed that the clusterRole does not have namespaces in it

my feeling is that this call would then fail but happy to be proven wrong!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note - i just found out that envTest also won't test this since it's spins up a cluster with a permissive auth mode

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just add describe namespace permission to the auth then, right?

Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carlydf! Much cleaner :)

resources:
- namespaces
verbs:
- get
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants