Include cluster UID in CONTROLLER_IDENTITY to prevent cross-cluster conflicts#307
Include cluster UID in CONTROLLER_IDENTITY to prevent cross-cluster conflicts#307
Conversation
…ontroller-on-controller fighting
Controllers that previously set ManagerIdentity as "release/namespace" (pre-cluster-UID) or "temporal-worker-controller" (pre-Helm default) would be permanently blocked from managing existing Worker Deployments after upgrading, since shouldClaimManagerIdentity only triggered on empty identity. Now reclaims in both migration cases: - exact match on the old hardcoded default - new identity is a longer slash-prefixed version of the existing one Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… distinct test identity - defaults.ControllerIdentity is now DeprecatedDefaultControllerIdentity to make clear it exists only for upgrade migration detection, not as a live default - integration tests use a dedicated testControllerIdentity constant so they exercise the normal identity path rather than accidentally triggering the deprecated-identity migration branch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetManagerIdentity with an empty identity clears the ManagerIdentity field on the Worker Deployment, leaving it ownerless. Add an explicit check and return an error rather than making the call. The startup check in main() is the primary enforcement; this is a targeted guard for the one call site where an empty value is actively dangerous. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jaypipes
left a comment
There was a problem hiding this comment.
Instead of adding a new clusterID values.yaml field and using the Helm lookup template function, how about just updating the controller Go code to always append the UID of the namespace that the controller is in to the end of the {release.name}/{release.namespace} existing value of the CONTROLLER_IDENTITY env var?
That way, there'd be no need for this new only-useful-when-running-helm-template values.yaml field and wouldn't need to change anything about the Helm chart at all.
| setupLog.Error(nil, "CONTROLLER_IDENTITY environment variable must be set") | ||
| os.Exit(1) | ||
| } | ||
| podNamespace := os.Getenv("POD_NAMESPACE") |
There was a problem hiding this comment.
already needed for WRT webhook (SA checks)
| if err := os.Setenv(controller.IdentityEnvKey, os.Getenv(controller.IdentityEnvKey)+"/"+string(ns.UID)); err != nil { | ||
| setupLog.Error(err, "unable to set CONTROLLER_IDENTITY") | ||
| os.Exit(1) | ||
| } |
There was a problem hiding this comment.
so i have a feeling that this won't work only because we don't have the worker-controller's RBAC policies updated
i checked here and noticed that the clusterRole does not have namespaces in it
my feeling is that this call would then fail but happy to be proven wrong!
There was a problem hiding this comment.
note - i just found out that envTest also won't test this since it's spins up a cluster with a permissive auth mode
There was a problem hiding this comment.
we should just add describe namespace permission to the auth then, right?
| resources: | ||
| - namespaces | ||
| verbs: | ||
| - get |
Summary
main()readsPOD_NAMESPACE, fetches the namespace object viamgr.GetAPIReader(), and appends the namespace UID toCONTROLLER_IDENTITY(format:{release}/{namespace}/{namespace-uid}). No Helm changes required."temporal-worker-controller"fallback fromgetControllerIdentity()— the env var is now required.shouldClaimManagerIdentityis extended to reclaim ownership of Worker Deployments that were managed by an older controller version, so existing deployments are not orphaned after an upgrade.testControllerIdentityconstant (not the deprecated default) so tests exercise the normal identity path rather than the migration branch.Why
Two controllers installed in different clusters but with the same release name and namespace shared the same identity string. That allowed them to both claim ownership of the same Temporal Worker Deployment. The controller's namespace UID is stable, unique per cluster, and already available without any Helm templating.
Appending the UID in Go (rather than in the Helm chart) means
helm template | kubectl applyworkflows work without any override — there is no offline vs. live-cluster split.Migration
Upgrading changes the controller's identity, so existing Worker Deployments whose
ManagerIdentitywas set by an older version need to be reclaimed. Without handling this, the new controller would see a non-emptyManagerIdentityit doesn't recognise as its own, skip the claim step, and be unable to manage those deployments going forward.Two legacy identity formats are handled:
ManagerIdentity"temporal-worker-controller""release/namespace"Detection for the second case uses a prefix check: if the new identity is
"release/namespace/{uid}"and the existing identity is"release/namespace", the new identity starts withexisting + "/", so the controller knows it is looking at its own older identity and reclaims.The old constant is retained in
defaultsasDeprecatedDefaultControllerIdentitywith a comment explaining its purpose, rather than being deleted, to make the intent explicit.Empty identity enforcement
An empty
CONTROLLER_IDENTITYis dangerous in two ways:SetManagerIdentitywith an empty string clears theManagerIdentityfield on the Worker Deployment, leaving it ownerless and claimable by any controller.SetCurrentVersion,SetRampingVersion, etc.) with an emptyIdentitycause the SDK to substitute its own default (typically hostname), producing junk in the Temporal audit trail.Three layers of enforcement:
main()— fast-fail at process start for normal Helm deployments.Reconcile()— fallback check for when the reconciler is imported as a library andmain()is not in the call path (e.g. embedded controller managers, tests).claimManagerIdentity()— last-resort guard specifically against the dangerous field-clear: refuses to callSetManagerIdentityif identity is empty.Test plan
go build ./...passeshelm templaterenders without error and without any cluster lookupCONTROLLER_IDENTITYset viat.SetenvManagerIdentityare reclaimed after upgrade🤖 Generated with Claude Code