Skip to content

Implement ClusterPolicy deletion and resource watches#2212

Open
retr0-kernel wants to merge 2 commits intoNVIDIA:mainfrom
retr0-kernel:main
Open

Implement ClusterPolicy deletion and resource watches#2212
retr0-kernel wants to merge 2 commits intoNVIDIA:mainfrom
retr0-kernel:main

Conversation

@retr0-kernel
Copy link

Description

controllers: implement ClusterPolicy TODOs

TODO 1 -> Handle deletion of the active singleton ClusterPolicy

What changed and why:

  • Added a guard that detects when the currently active singleton ClusterPolicy
    has DeletionTimestamp != nil (Kubernetes sets this when the object is being
    deleted but still has finalizers).
  • On detection, resets clusterPolicyCtrl to a zero value so the internal state
    is clean and a new singleton election can happen on the next reconcile.
  • Returns RequeueAfter: 5s (non-deprecated; replaces the originally suggested
    Requeue: true) so the controller re-runs after the object is fully removed
    and can pick up the next available ClusterPolicy from the cluster.
  • The existing duplicate-ignore path is preserved and unchanged, it now runs as
    a separate if block immediately after the deletion guard.

TODO 2 -> Replace placeholder secondary-resource watches

What changed and why:

  • The placeholder comment (TODO(user): Modify this to be the types you create)
    has been removed.
  • Added watches for Deployment, ConfigMap, Service, and ServiceAccount
    all four are managed by ClusterPolicyController via object_controls.go
    (e.g. Deployment(), Service(), ConfigMaps(), ServiceAccount()).
  • All watches use TypedEnqueueRequestForOwner with OnlyControllerOwner() so
    only resources whose ownerReference.controller=true points to a
    ClusterPolicy trigger a reconcile matching how ownership is set via
    controllerutil.SetControllerReference in object_controls.go.
  • Without these watches, changes to those secondary resources (e.g. someone
    manually deleting a Service or ConfigMap) would not trigger reconciliation and
    the operator would not self-heal.

Refactor -> Extract buildNodeMapFn and buildNodePredicate

The inline closures inside addWatchNewGPUNode have been extracted into two
named package-level helpers:

  • buildNodeMapFn(r) -> returns the map function that enqueues reconcile requests
    for all ClusterPolicy objects when a relevant Node event fires.
  • buildNodePredicate() -> returns the TypedFuncs[*corev1.Node] predicate
    (Create / Update / Delete filters).

addWatchNewGPUNode now calls these helpers and wraps UpdateFunc to add the
logger call, keeping the predicate itself logger-free and independently testable.

This is a pure refactor with no behaviour change it exists to make the logic
unit-testable without a real ctrl.Manager.

Removed TODO comment on Reconcile doc comment

The scaffolding comment TODO(user): Modify the Reconcile function to compare the state specified by... has been replaced with an accurate description of what
the function actually does, since the implementation is complete.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Build verified:

go build ./controllers/...   # exits 0, no errors

Existing tests unaffected:

go test -v -run TestReconcile ./controllers/
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        GPU workload configuration      {"NodeName": "node0", "GpuWorkloadConfig": "container"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Node has GPU(s) {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.present", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Checking GPU state labels on the node   {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.node-status-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.operator-validator", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.driver", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.gpu-feature-discovery", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.container-toolkit", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.device-plugin", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.dcgm", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node0", "Label": "nvidia.com/gpu.deploy.dcgm-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Applying correct GPU state labels to the node   {"NodeName": "node0"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        GPU workload configuration      {"NodeName": "node1", "GpuWorkloadConfig": "container"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Node has GPU(s) {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.present", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Checking GPU state labels on the node   {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.node-status-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.operator-validator", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.driver", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.gpu-feature-discovery", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.container-toolkit", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.device-plugin", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.dcgm", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Setting node label      {"NodeName": "node1", "Label": "nvidia.com/gpu.deploy.dcgm-exporter", "Value": "true"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Applying correct GPU state labels to the node   {"NodeName": "node1"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Number of nodes with GPU label  {"NodeCount": 2}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Found kernel version label      {"Request.Namespace": "default", "Request.Name": "Node", "version": "5.4.0-generic"}
2026-03-16T21:54:39+05:30       INFO    controller.ClusterPolicy        Found kernel version label      {"Request.Namespace": "default", "Request.Name": "Node", "version": "5.4.135-generic"}
=== RUN   TestReconcile
=== RUN   TestReconcile/ClusterPolicy_has_driver_CRD_false_→_reconciliation_skips_driver
=== RUN   TestReconcile/ClusterPolicy_has_driver_CRD_true_but_validator_errors
=== RUN   TestReconcile/driver_CRD_true,_no_validator_errors,_use_precompiled_drivers_and_GDS_enabled
--- PASS: TestReconcile (0.00s)
    --- PASS: TestReconcile/ClusterPolicy_has_driver_CRD_false_→_reconciliation_skips_driver (0.00s)
    --- PASS: TestReconcile/ClusterPolicy_has_driver_CRD_true_but_validator_errors (0.00s)
    --- PASS: TestReconcile/driver_CRD_true,_no_validator_errors,_use_precompiled_drivers_and_GDS_enabled (0.00s)
PASS
ok      github.com/NVIDIA/gpu-operator/controllers      1.748s

retr0-kernel and others added 2 commits March 16, 2026 20:43
- Handle deletion of the active singleton ClusterPolicy and cycle to
  the next available one (previously a TODO with no implementation).

- Replace the placeholder secondary-resource watch (TODO comment) with
  proper watches for all resource types owned by ClusterPolicy:
  DaemonSet, Deployment, ConfigMap, Service, ServiceAccount.

- Use RequeueAfter (5 s) instead of the deprecated Requeue field
  throughout the new code path.

Signed-off-by: Krish Srivastava <krish22092003@gmail.com>
controllers: implement ClusterPolicy TODOs
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@retr0-kernel
Copy link
Author

Hi Reviewers,
Please let me know if any changes or inputs needed from my side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant