GitHub user jpt1624 created a discussion: KVM HA not functioning
### problem
Hello, I am having issues with getting HA to function with my two KVM hosts.
The cluster, the two hosts, and a test virtual machine each have HA enabled.
The KVM hosts have OOB management configured using ipmitool.
For testing, I have a virtual machine with an HA supported policy running on
KVM-02. I power off KVM-02 abruptly to see if the virtual machine will
automatically migrate over to KVM-01.
What occurs is the following:
**KVM-02 is determined to be Disconnected by cloudstack management:**
Host
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
has the status [Disconnected].
**KVM-02 is then set to the DOWN status (supposedly):**
_{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
has the status [Down]._
**KVM-01 then checks connectivity with KVM-02, which also returns a status of
DOWN:**
_Neighbouring Host
{"id":43,"name":"kvm-01","type":"Routing","uuid":"025ccefd-1696-43c9-9a2c-e045968d2efa"}
returned status [Down] for the investigated Host
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}._
**The shared storage volume mounted onto KVM-02 is checked for any recent
writes:**
_Checking VM activity for Host
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
on storage pool [StoragePool
{"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]._
_Host
{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}
does not have activity on storage pool [StoragePool
{"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]_
**Also while these are occurring, the API states that the status for KVM-02 is
UP:**
<img width="549" height="215" alt="Image"
src="https://github.com/user-attachments/assets/d5de6cb1-7a0e-4db0-bfbe-eeacfefe74c8"
/>
After about 10-15 minutes, we progress to the ALERT state for KVM-02. I am not
sure why it takes this many attempts because we have set this condition in the
settings for 5 checks:
<img width="920" height="61" alt="Image"
src="https://github.com/user-attachments/assets/b3f8af7e-96e0-475f-8c15-0c01d908b6fd"
/>
At the ALERT state, now the HA task tries to power OFF KVM-02 (assuming to
prevent split brain prior to moving the virtual machines over):
<img width="925" height="260" alt="Image"
src="https://github.com/user-attachments/assets/65a4a7c8-8be4-4cdf-8915-cf476fee37e7"
/>
This command fails because KVM-02 is already OFF.
<img width="249" height="103" alt="Image"
src="https://github.com/user-attachments/assets/099ef8b5-0a4d-42f3-a10d-4230dbe3f158"
/>
Cloudstack will continue to try to power KVM-02 off until I manually issue the
OOB power up command. Cloudstack's power OFF command then will work. When this
happens we progress to marking KVM-02 as DOWN:
<img width="925" height="350" alt="Image"
src="https://github.com/user-attachments/assets/2e689194-6824-4223-b892-001702f788a4"
/>
<img width="928" height="93" alt="Image"
src="https://github.com/user-attachments/assets/343ce057-32ea-417f-ab47-03b039245f39"
/>
<img width="431" height="107" alt="Image"
src="https://github.com/user-attachments/assets/6e59b2ee-a18f-4aaf-b91c-06c030908963"
/>
Here are the investigators configured:
<img width="1431" height="261" alt="Image"
src="https://github.com/user-attachments/assets/a2072b97-8e19-472b-93b7-69e0f36bbcd6"
/>
### versions
Cloudstack: 4.22.0.0
KVM-01: 4.22.0.0
KVM-02: 4.22.0.0
### The steps to reproduce the bug
1. Create KVM cluster
2. Assign two KVM hosts under cluster
3. Enable HA for cluster and KVM hosts
4. Configure OOB management for KVM hosts
5. Create test VM under a KVM host with HA supported policy
6. Assign SimpleInvestigator, PingInvestigator, and KVMInvestigator under HA
investigators order.
7. Power off KVM host abruptly to simulate failure scenario.
### What to do about it?
Not sure if my configuration is incorrect or underlying issue is present. The
behavior is confusing. Please let me know if I can provide anything else.
Thanks!
GitHub link: https://github.com/apache/cloudstack/discussions/12139
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]