KVM Host HA: host reaches Fenced but VMs remain Running on failed host and HA work is marked Done without restart

##### ISSUE TYPE

* Bug Report

##### COMPONENT NAME

KVM, HA, Orchestration

##### CLOUDSTACK VERSION

4.22.x

##### CONFIGURATION

- KVM cluster
- Host HA enabled
- VM HA enabled
- Shared storage - netapp NFS 4.1
- 2 management servers
- Host HA provider: `kvmhaprovider`
- OBM reconfigured to IPMI for this test
- Relevant tuning applied before this test:
  - `wait = 20`
  - `kvm.ha.activity.check.interval = 20`
  - `kvm.ha.activity.check.max.attempts = 5`
  - `kvm.ha.degraded.max.period = 120`

##### OS / ENVIRONMENT

ACS 4.22
NetApp NFS 4.1
Ubuntu 24.04
mysql  Ver 8.0.45-0ubuntu0.24.04.1
libvirt0:amd64   10.0.0-2ubuntu8.11 

##### SUMMARY

When a KVM host is powered off, CloudStack eventually fences the host, but some VMs that were on the failed host remain in `Running` state on that host and are not restarted elsewhere.

In this case:
- the failed host reaches `ha_state = Fenced`
- HA work items are created for the affected VMs with reason `HostDown`
- those HA work items are marked `Done`
- but the affected VMs still remain:
  - `state = Running`
  - `host_id = failed host`
  - `power_state = PowerOn`
  - `power_host = failed host`

So CloudStack appears to consider HA recovery complete even though the VMs are still recorded as running on a fenced/offline host.

##### STEPS TO REPRODUCE

1. Configure a KVM cluster with Host HA and VM HA enabled. (ipmi oob)
2. Ensure there are user/system VMs running on host.
3. Tune the HA settings as follows and restart management:
   - `wait = 20`
   - `kvm.ha.activity.check.interval = 20`
   - `kvm.ha.activity.check.max.attempts = 5`
   - `kvm.ha.degraded.max.period = 120`
4. Power off host.
5. Observe management logs and database state (`ha_config`, `op_ha_work`, `vm_instance`).

##### EXPECTED RESULTS

After the failed host is confirmed down and fenced:

1. HA should identify all VMs that were on the failed host as unavailable.
2. Those VMs should no longer remain `Running` on the failed host.
3. CloudStack should either:
   - transition them to `Stopped` and restart them on an eligible host, or
   - at minimum correct their VM/power state to reflect that they are no longer running on the fenced host.
4. HA work items should only be marked `Done` after the VM state and placement are consistent with the actual recovery result.

##### ACTUAL RESULTS

The failed host reaches `Fenced` in `ha_config`:

```sql
select * from ha_config;
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+
| id | resource_id | resource_type | enabled | ha_state   | provider      | update_count | update_time         | mgmt_server_id  |
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+
|  1 |           6 | Host          |       1 | Fenced     | kvmhaprovider |          131 | 2026-03-31 06:42:57 | 248902281439561 |
|  2 |           5 | Host          |       1 | Ineligible | kvmhaprovider |           19 | 2026-03-31 06:31:21 | 248902281439561 |
+----+-------------+---------------+---------+------------+---------------+--------------+---------------------+-----------------+

```

[logs.txt](https://github.com/user-attachments/files/26370096/logs.txt)

However, multiple VMs still remain Running on the fenced host 6:

```
SELECT id, instance_name, state, host_id, last_host_id, power_state, power_host, update_time, power_state_update_time
FROM vm_instance
WHERE host_id = 6
ORDER BY id;

+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
| id | instance_name | state   | host_id | last_host_id | power_state | power_host | update_time         | power_state_update_time |
+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
| 78 | r-78-VM       | Running |       6 |            6 | PowerOn     |          6 | 2026-03-31 06:19:55 | 2026-03-31 06:26:12     |
| 80 | v-80-VM       | Running |       6 |            6 | PowerOn     |          6 | 2026-03-31 06:19:15 | 2026-03-31 06:26:12     |
| 84 | s-84-VM       | Running |       6 |            6 | PowerOn     |          6 | 2026-03-31 06:19:14 | 2026-03-31 06:26:12     |
| 87 | i-2-87-VM     | Running |       6 |            6 | PowerOn     |          6 | 2026-03-31 06:21:56 | 2026-03-31 06:26:12     |
| 88 | i-2-88-VM     | Running |       6 |            6 | PowerOn     |          6 | 2026-03-31 06:21:56 | 2026-03-31 06:26:12     |
+----+---------------+---------+---------+--------------+-------------+------------+---------------------+-------------------------+
```

At the same time, HA work is created for those affected VMs and marked Done with reason HostDown:

```
SELECT *
FROM op_ha_work
ORDER BY id DESC
LIMIT 50;

+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+
| id | instance_id | type | vm_type            | state   | mgmt_server_id  | host_id | created             | tried | taken               | step | time_to_try | updated | reason   |
+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+
| 85 |          88 | HA   | User               | Running | 248902281439561 |       6 | 2026-03-31 06:31:18 |     0 | 2026-03-31 06:31:18 | Done | 1733338553  |      18 | HostDown |
| 84 |          87 | HA   | User               | Running | 248902281439561 |       6 | 2026-03-31 06:31:18 |     0 | 2026-03-31 06:31:18 | Done | 1733338553  |      13 | HostDown |
| 83 |          78 | HA   | DomainRouter       | Running | 248902281439561 |       6 | 2026-03-31 06:31:18 |     3 | 2026-03-31 06:31:18 | Done | 1733338553  |     100 | HostDown |
| 82 |          80 | HA   | ConsoleProxy       | Running | 248902281439561 |       6 | 2026-03-31 06:31:18 |     0 | 2026-03-31 06:31:18 | Done | 1733338553  |      35 | HostDown |
| 81 |          84 | HA   | SecondaryStorageVm | Running | 248902281439561 |       6 | 2026-03-31 06:31:18 |     0 | 2026-03-31 06:31:18 | Done | 1733338553  |      30 | HostDown |
+----+-------------+------+--------------------+---------+-----------------+---------+---------------------+-------+---------------------+------+-------------+---------+----------+

```

This leaves CloudStack in an inconsistent state:

host is fenced
HA work is complete
but VMs are still shown as running on the failed host
and no failover/restart occurs for those VMs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM Host HA: host reaches Fenced but VMs remain Running on failed host and HA work is marked Done without restart #12922

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KVM Host HA: host reaches Fenced but VMs remain Running on failed host and HA work is marked Done without restart #12922

Description

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions