KVM Host HA code improvements#13088
Conversation
|
@blueorangutan package |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13088 +/- ##
=========================================
Coverage 17.67% 17.68%
- Complexity 15789 15793 +4
=========================================
Files 5922 5922
Lines 533094 533119 +25
Branches 65210 65201 -9
=========================================
+ Hits 94246 94259 +13
- Misses 428208 428216 +8
- Partials 10640 10644 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17658 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
DaanHoogland
left a comment
There was a problem hiding this comment.
lgtm and good cleanup (needs testing though)
|
[SF] Trillian test result (tid-15989)
|
… progress, and some code improvements - When Host HA inspection in progress, the investigor returns the Host Status as Up which cancels the VM HA items - Don't cancel the VM HA items, instead reschedule them to try again later
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17729 |
…agent connection status to determine the Host HA inspection in progress or not, and some code improvements
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17741 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-16032)
|
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17764 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17779 |
|
@blueorangutan test |
|
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-16044)
|
kiranchavala
left a comment
There was a problem hiding this comment.
LGTM
Tested manually with the following global setting value
-
Create a HA enabled offering
-
Deploy a vm with ha enabled offering
-
Confiugure OOBM on the kvm host with either ipmi or redfish
-
Enable host ha on the kvm host
-
Trigger kernel panic on the host
echo c > /proc/sysrq-trigger
-
Host Ha successfully gets triggered
-
VM HA also kicks in
Logs
[root@ref-trl-11669-k-Mol8-kiran-chavala-mgmt1 ~]# cat /var/log/cloudstack/management/management-server.log.2026-05-06 |grep -i "logid:9b1252df"
2026-05-06 17:29:42,814 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Processing work HAWork[8-HA-7-Running-Investigating]
2026-05-06 17:29:42,816 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) RESTART with HA WORK
2026-05-06 17:29:42,819 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Checking Host HA inspection is in progress or not for the host 1 from HAConfig, HA state is Fenced
2026-05-06 17:29:42,820 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) SimpleInvestigator unable to determine the state of the host. Moving on.
2026-05-06 17:29:42,820 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) XenServerInvestigator unable to determine the state of the host. Moving on.
2026-05-06 17:29:42,822 DEBUG [o.a.c.h.HAManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA: Agent [Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"}] is fenced.
2026-05-06 17:29:42,823 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) KVMInvestigator was able to determine host Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"} is in Down
2026-05-06 17:29:42,823 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA on VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,825 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Wait time setting on com.cloud.agent.api.CheckVirtualMachineCommand is 20 seconds
2026-05-06 17:29:42,827 DEBUG [c.c.h.CheckOnAgentInvestigator] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Unable to reach the agent for VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}: Resource [Host:1] is unreachable: Host 1: Host with specified id is not in the right state: Down
2026-05-06 17:29:42,827 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) SimpleInvestigator could not find VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,827 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) XenServerInvestigator could not find VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"}
2026-05-06 17:29:42,829 DEBUG [o.a.c.h.HAManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA: Host [Host {"id":1,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm1","type":"Routing","uuid":"fe5f72ef-5285-4d8d-91f7-c7a951ff2279"}] is fenced.
2026-05-06 17:29:42,829 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) KVMInvestigator found VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"} to be alive? false
2026-05-06 17:29:42,829 WARN [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Job is executed without a context, setup psudo job for the executing thread
2026-05-06 17:29:42,843 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) Sync job-91 execution on object VmWorkJobQueue.7
2026-05-06 17:29:43,691 DEBUG [c.c.v.ClusteredVirtualMachineManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8, ctx-b87ca687]) (logid:9b1252df) start parameter value of enterHardwareSetup == null during processing of queued job
2026-05-06 17:29:43,698 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8, ctx-b87ca687]) (logid:9b1252df) Sync job-92 execution on object VmWorkJobQueue.7
2026-05-06 17:29:48,376 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) HA is now restarting VM instance {"id":7,"instanceName":"i-2-7-VM","state":"Running","type":"User","uuid":"eeacfde8-6019-477d-8551-952c4a2dc7cf"} on Host {"id":2,"name":"ref-trl-11669-k-Mol8-kiran-chavala-kvm2","type":"Routing","uuid":"c5bbfd66-800f-4634-ae73-46aa4d823868"}
2026-05-06 17:29:48,379 WARN [c.c.a.AlertManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) alertType=[8] dataCenterId=[1] podId=[1] clusterId=[null] message=[HA starting VM: t5 (i-2-7-VM)].
2026-05-06 17:29:48,390 WARN [c.c.a.AlertManagerImpl] (HA-Worker-4:[ctx-094abc4b, work-8]) (logid:9b1252df) No recipients set in global setting 'alert.email.addresses', skipping sending alert with subject [HA starting VM: t5 (i-2-7-VM)] and content [HA starting VM: t5 (i-2-7-VM)].
2026-0
|
[SF] Trillian test result (tid-16055)
|
Description
This PR addresses the fix to not cancel VM HA items when Host HA is enabled & inspection in progress and improves the Host HA code (updates logs and some refactoring / cleanup).
When Host HA inspection in progress, the KVM investigor returns the Host Status as Up which cancels the VM HA items, don't cancel the VM HA items instead reschedule them to try again later.
This addresses #7543, #12922
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?