Skip to content

[multiple] Co-locate provisionserver with metal3 to prevent DHCP fail…#3738

Open
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix
Open

[multiple] Co-locate provisionserver with metal3 to prevent DHCP fail…#3738
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix

Conversation

@mnietoji
Copy link
Contributor

@mnietoji mnietoji commented Mar 4, 2026

…ures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets ipv4.method=disabled. NMState operator then preserves this disabled state, causing permanent loss of provisioning network connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on different nodes. If metal3 restarts while a node is attempting DHCP renewal, the temporary unavailability of metal3-dnsmasq causes the renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3 label) and configure provisionServerNodeSelector in baremetalSetTemplate to schedule OpenStackProvisionServer on the same node. This ensures provisioning network connectivity is maintained because metal3-static-ip-manager maintains a static IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,d660efa12350eb88ab3c89b1d91a04abcbc82293

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d660efa to 369ae18 Compare March 4, 2026 11:42
@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,369ae185bc2b7d5a266e63c93224f86f1d2723cd

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 369ae18 to 3fa51c9 Compare March 4, 2026 11:49
@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,3fa51c9d28a6f3c53f0c99dbbdef1baf476724d5

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3217375613864e0d83f7f88f394dcfaa

openstack-k8s-operators-content-provider FAILURE in 7m 16s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal-minor-update SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ cifmw-pod-zuul-files SUCCESS in 4m 25s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 21s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 44s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 54s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 24s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 14s

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d0cf92f to 6b9c8b0 Compare March 4, 2026 14:44
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 6b9c8b0 to 1339a1d Compare March 4, 2026 17:48
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 1339a1d to cf58db9 Compare March 5, 2026 10:55
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 3 times, most recently from e29b915 to 08a2b2b Compare March 10, 2026 15:14
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0b04bcb1f4f54d518d017da862888f74

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 03m 30s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 22m 08s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 25m 42s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 49m 34s
✔️ cifmw-pod-zuul-files SUCCESS in 5m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 35s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
cifmw-pod-pre-commit TIMED_OUT in 31m 04s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 29s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 33s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 07s

…ures II

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the
provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets
ipv4.method=disabled. NMState operator then preserves this disabled state,
causing permanent loss of provisioning network connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on
different nodes. If metal3 restarts while a node is attempting DHCP renewal,
the temporary unavailability of metal3-dnsmasq causes the renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3 label)
and configure provisionServerNodeSelector in baremetalSetTemplate to schedule
OpenStackProvisionServer on the same node. This ensures provisioning network
connectivity is maintained because metal3-static-ip-manager maintains a static
IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts.

Signed-off-by: Miguel Angel Nieto Jimenez <mnietoji@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 08a2b2b to a7b18cb Compare March 12, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants