ISSUE TYPE
COMPONENT NAME
KVM, HA, Orchestration
CLOUDSTACK VERSION
4.22.x
CONFIGURATION
- KVM cluster
- Host HA enabled
- VM HA enabled
- Shared storage - netapp NFS 4.1
- 2 management servers
- Host HA provider:
kvmhaprovider
- OBM reconfigured to IPMI for this test
- Relevant tuning applied before this test:
wait = 20
kvm.ha.activity.check.interval = 20
kvm.ha.activity.check.max.attempts = 5
kvm.ha.degraded.max.period = 120
OS / ENVIRONMENT
ACS 4.22
NetApp NFS 4.1
Ubuntu 24.04
mysql Ver 8.0.45-0ubuntu0.24.04.1
libvirt0:amd64 10.0.0-2ubuntu8.11
SUMMARY
When a KVM host is powered off, CloudStack eventually fences the host, but some VMs that were on the failed host remain in Running state on that host and are not restarted elsewhere.
In this case:
- the failed host reaches
ha_state = Fenced
- HA work items are created for the affected VMs with reason
HostDown
- those HA work items are marked
Done
- but the affected VMs still remain:
state = Running
host_id = failed host
power_state = PowerOn
power_host = failed host
So CloudStack appears to consider HA recovery complete even though the VMs are still recorded as running on a fenced/offline host.
STEPS TO REPRODUCE
- Configure a KVM cluster with Host HA and VM HA enabled. (ipmi oob)
- Ensure there are user/system VMs running on host.
- Tune the HA settings as follows and restart management:
wait = 20
kvm.ha.activity.check.interval = 20
kvm.ha.activity.check.max.attempts = 5
kvm.ha.degraded.max.period = 120
- Power off host.
- Observe management logs and database state (
ha_config, op_ha_work, vm_instance).
EXPECTED RESULTS
After the failed host is confirmed down and fenced:
- HA should identify all VMs that were on the failed host as unavailable.
- Those VMs should no longer remain
Running on the failed host.
- CloudStack should either:
- transition them to
Stopped and restart them on an eligible host, or
- at minimum correct their VM/power state to reflect that they are no longer running on the fenced host.
- HA work items should only be marked
Done after the VM state and placement are consistent with the actual recovery result.
ACTUAL RESULTS
The failed host reaches Fenced in ha_config:
select * from ha_config;
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+
| id | resource_id | resource_type | enabled | ha_state | provider | update_count | update_time | mgmt_server_id |
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+
| 1 | 6 | Host | 1 | Fenced | kvmhaprovider | 131 | 2026-03-31 06:42:57 | 248902281439561 |
| 2 | 5 | Host | 1 | Ineligible | kvmhaprovider | 19 | 2026-03-31 06:31:21 | 248902281439561 |
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+
logs.txt
However, multiple VMs still remain Running on the fenced host 6:
SELECT id, instance_name, state, host_id, last_host_id, power_state, power_host, update_time, power_state_update_time
FROM vm_instance
WHERE host_id = 6
ORDER BY id;
+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
| id | instance_name | state | host_id | last_host_id | power_state | power_host | update_time | power_state_update_time |
+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
| 78 | r-78-VM | Running | 6 | 6 | PowerOn | 6 | 2026-03-31 06:19:55 | 2026-03-31 06:26:12 |
| 80 | v-80-VM | Running | 6 | 6 | PowerOn | 6 | 2026-03-31 06:19:15 | 2026-03-31 06:26:12 |
| 84 | s-84-VM | Running | 6 | 6 | PowerOn | 6 | 2026-03-31 06:19:14 | 2026-03-31 06:26:12 |
| 87 | i-2-87-VM | Running | 6 | 6 | PowerOn | 6 | 2026-03-31 06:21:56 | 2026-03-31 06:26:12 |
| 88 | i-2-88-VM | Running | 6 | 6 | PowerOn | 6 | 2026-03-31 06:21:56 | 2026-03-31 06:26:12 |
+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
At the same time, HA work is created for those affected VMs and marked Done with reason HostDown:
SELECT *
FROM op_ha_work
ORDER BY id DESC
LIMIT 50;
+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+
| id | instance_id | type | vm_type | state | mgmt_server_id | host_id | created | tried | taken | step | time_to_try | updated | reason |
+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+
| 85 | 88 | HA | User | Running | 248902281439561 | 6 | 2026-03-31 06:31:18 | 0 | 2026-03-31 06:31:18 | Done | 1733338553 | 18 | HostDown |
| 84 | 87 | HA | User | Running | 248902281439561 | 6 | 2026-03-31 06:31:18 | 0 | 2026-03-31 06:31:18 | Done | 1733338553 | 13 | HostDown |
| 83 | 78 | HA | DomainRouter | Running | 248902281439561 | 6 | 2026-03-31 06:31:18 | 3 | 2026-03-31 06:31:18 | Done | 1733338553 | 100 | HostDown |
| 82 | 80 | HA | ConsoleProxy | Running | 248902281439561 | 6 | 2026-03-31 06:31:18 | 0 | 2026-03-31 06:31:18 | Done | 1733338553 | 35 | HostDown |
| 81 | 84 | HA | SecondaryStorageVm | Running | 248902281439561 | 6 | 2026-03-31 06:31:18 | 0 | 2026-03-31 06:31:18 | Done | 1733338553 | 30 | HostDown |
+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+
This leaves CloudStack in an inconsistent state:
host is fenced
HA work is complete
but VMs are still shown as running on the failed host
and no failover/restart occurs for those VMs
ISSUE TYPE
COMPONENT NAME
KVM, HA, Orchestration
CLOUDSTACK VERSION
4.22.x
CONFIGURATION
kvmhaproviderwait = 20kvm.ha.activity.check.interval = 20kvm.ha.activity.check.max.attempts = 5kvm.ha.degraded.max.period = 120OS / ENVIRONMENT
ACS 4.22
NetApp NFS 4.1
Ubuntu 24.04
mysql Ver 8.0.45-0ubuntu0.24.04.1
libvirt0:amd64 10.0.0-2ubuntu8.11
SUMMARY
When a KVM host is powered off, CloudStack eventually fences the host, but some VMs that were on the failed host remain in
Runningstate on that host and are not restarted elsewhere.In this case:
ha_state = FencedHostDownDonestate = Runninghost_id = failed hostpower_state = PowerOnpower_host = failed hostSo CloudStack appears to consider HA recovery complete even though the VMs are still recorded as running on a fenced/offline host.
STEPS TO REPRODUCE
wait = 20kvm.ha.activity.check.interval = 20kvm.ha.activity.check.max.attempts = 5kvm.ha.degraded.max.period = 120ha_config,op_ha_work,vm_instance).EXPECTED RESULTS
After the failed host is confirmed down and fenced:
Runningon the failed host.Stoppedand restart them on an eligible host, orDoneafter the VM state and placement are consistent with the actual recovery result.ACTUAL RESULTS
The failed host reaches
Fencedinha_config:logs.txt
However, multiple VMs still remain Running on the fenced host 6:
At the same time, HA work is created for those affected VMs and marked Done with reason HostDown:
This leaves CloudStack in an inconsistent state:
host is fenced
HA work is complete
but VMs are still shown as running on the failed host
and no failover/restart occurs for those VMs