Implement ClusterPolicy deletion and resource watches by retr0-kernel · Pull Request #2212 · NVIDIA/gpu-operator

retr0-kernel · 2026-03-16T16:35:32Z

Description

controllers: implement ClusterPolicy TODOs

TODO 1 -> Handle deletion of the active singleton ClusterPolicy

What changed and why:

Added a guard that detects when the currently active singleton ClusterPolicy
has DeletionTimestamp != nil (Kubernetes sets this when the object is being
deleted but still has finalizers).
On detection, resets clusterPolicyCtrl to a zero value so the internal state
is clean and a new singleton election can happen on the next reconcile.
Returns RequeueAfter: 5s (non-deprecated; replaces the originally suggested
Requeue: true) so the controller re-runs after the object is fully removed
and can pick up the next available ClusterPolicy from the cluster.
The existing duplicate-ignore path is preserved and unchanged, it now runs as
a separate if block immediately after the deletion guard.

TODO 2 -> Replace placeholder secondary-resource watches

What changed and why:

The placeholder comment (TODO(user): Modify this to be the types you create)
has been removed.
Added watches for Deployment, ConfigMap, Service, and ServiceAccount
all four are managed by ClusterPolicyController via object_controls.go
(e.g. Deployment(), Service(), ConfigMaps(), ServiceAccount()).
All watches use TypedEnqueueRequestForOwner with OnlyControllerOwner() so
only resources whose ownerReference.controller=true points to a
ClusterPolicy trigger a reconcile matching how ownership is set via
controllerutil.SetControllerReference in object_controls.go.
Without these watches, changes to those secondary resources (e.g. someone
manually deleting a Service or ConfigMap) would not trigger reconciliation and
the operator would not self-heal.

Refactor -> Extract `buildNodeMapFn` and `buildNodePredicate`

The inline closures inside addWatchNewGPUNode have been extracted into two
named package-level helpers:

buildNodeMapFn(r) -> returns the map function that enqueues reconcile requests
for all ClusterPolicy objects when a relevant Node event fires.
buildNodePredicate() -> returns the TypedFuncs[*corev1.Node] predicate
(Create / Update / Delete filters).

addWatchNewGPUNode now calls these helpers and wraps UpdateFunc to add the
logger call, keeping the predicate itself logger-free and independently testable.

This is a pure refactor with no behaviour change it exists to make the logic
unit-testable without a real ctrl.Manager.

Removed TODO comment on `Reconcile` doc comment

The scaffolding comment TODO(user): Modify the Reconcile function to compare the state specified by... has been replaced with an accurate description of what
the function actually does, since the implementation is complete.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Build verified:

go build ./controllers/...   # exits 0, no errors

Existing tests unaffected:

go test -v -run TestReconcile ./controllers/
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        GPU workload configuration      {"NodeName": "node0", "GpuWorkloadConfig": "container"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Node has GPU(s) {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.present", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Checking GPU state labels on the node   {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.node-status-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.operator-validator", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.driver", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.gpu-feature-discovery", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.container-toolkit", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.device-plugin", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.dcgm", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.dcgm-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Applying correct GPU state labels to the node   {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        GPU workload configuration      {"NodeName": "node1", "GpuWorkloadConfig": "container"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Node has GPU(s) {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.present", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Checking GPU state labels on the node   {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.node-status-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.operator-validator", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.driver", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.gpu-feature-discovery", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.container-toolkit", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.device-plugin", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.dcgm", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.dcgm-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Applying correct GPU state labels to the node   {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Number of nodes with GPU label  {"NodeCount": 2}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Found kernel version label      {"Request.Namespace": "default", "Request.Name": "Node", "version": "5.4.0-generic"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Found kernel version label      {"Request.Namespace": "default", "Request.Name": "Node", "version": "5.4.135-generic"}
=== RUN   TestReconcile
=== RUN   TestReconcile/ClusterPolicy_has_driver_CRD_false_→_reconciliation_skips_driver
=== RUN   TestReconcile/ClusterPolicy_has_driver_CRD_true_but_validator_errors
=== RUN   TestReconcile/driver_CRD_true,_no_validator_errors,_use_precompiled_drivers_and_GDS_enabled
--- PASS: TestReconcile (0.00s)
    --- PASS: TestReconcile/ClusterPolicy_has_driver_CRD_false_→_reconciliation_skips_driver (0.00s)
    --- PASS: TestReconcile/ClusterPolicy_has_driver_CRD_true_but_validator_errors (0.00s)
    --- PASS: TestReconcile/driver_CRD_true,_no_validator_errors,_use_precompiled_drivers_and_GDS_enabled (0.00s)
PASS
ok      github.com/NVIDIA/gpu-operator/controllers      1.748s

- Handle deletion of the active singleton ClusterPolicy and cycle to the next available one (previously a TODO with no implementation). - Replace the placeholder secondary-resource watch (TODO comment) with proper watches for all resource types owned by ClusterPolicy: DaemonSet, Deployment, ConfigMap, Service, ServiceAccount. - Use RequeueAfter (5 s) instead of the deprecated Requeue field throughout the new code path. Signed-off-by: Krish Srivastava <krish22092003@gmail.com>

controllers: implement ClusterPolicy TODOs

copy-pr-bot · 2026-03-16T16:35:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

retr0-kernel · 2026-03-16T16:56:07Z

Hi Reviewers,
Please let me know if any changes or inputs needed from my side

retr0-kernel and others added 2 commits March 16, 2026 20:43

Merge pull request #1 from retr0-kernel/feat/clusterpolicy-todos

b7a002b

controllers: implement ClusterPolicy TODOs

retr0-kernel requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners March 16, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ClusterPolicy deletion and resource watches#2212

Implement ClusterPolicy deletion and resource watches#2212
retr0-kernel wants to merge 2 commits into
NVIDIA:mainfrom
retr0-kernel:main

retr0-kernel commented Mar 16, 2026

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

retr0-kernel commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

retr0-kernel commented Mar 16, 2026

Description

TODO 1 -> Handle deletion of the active singleton ClusterPolicy

TODO 2 -> Replace placeholder secondary-resource watches

Refactor -> Extract buildNodeMapFn and buildNodePredicate

Removed TODO comment on Reconcile doc comment

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented Mar 16, 2026

Uh oh!

retr0-kernel commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor -> Extract `buildNodeMapFn` and `buildNodePredicate`

Removed TODO comment on `Reconcile` doc comment