I have KVM Host HA enabled and power is lost to one of the compute nodes. The
host has it's state marked as alert and the HA states go through degraded to
suspect to Fencing.
The problem is that the host is never fenced because there is no power to it so
none of the OOBM commands work which means the VMs are never migrated.
From the management server logs -
2019-03-04 11:02:48,288 WARN [o.a.c.h.t.BaseHATask] (pool-6-thread-9:null)
(logid:d0a19f20) Exception occurred while running FenceTask on a resource:
org.apache.cloudstack.ha.provider.HAFenceException: OOBM service is not
configured or enabled for this host dcp-cscn2.local
org.apache.cloudstack.ha.provider.HAFenceException: OOBM service is not
configured or enabled for this host dcp-cscn2.local
at
org.apache.cloudstack.kvm.ha.KVMHAProvider.fence(KVMHAProvider.java:99)
at
org.apache.cloudstack.kvm.ha.KVMHAProvider.fence(KVMHAProvider.java:42)
at
org.apache.cloudstack.ha.task.FenceTask.performAction(FenceTask.java:42)
at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86)
at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.cloud.utils.exception.CloudRuntimeException: Out-of-band
Management action (OFF) on host (b53122bc-1446-4ffd-a179-e363ad0d541f) failed
with error: Get Auth Capabilities error
Error issuing Get Channel Authentication Capabilities request
Error: Unable to establish IPMI v2 / RMCP+ session
at
org.apache.cloudstack.outofbandmanagement.OutOfBandManagementServiceImpl.executePowerOperation(OutOfBandManagementServiceImpl.java:423)
at sun.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
... 21 more
which begs the question how is this meant to work for a host whose power has
failed.
If I turn off KVM Host HA and change the ping interval to 30 and ping timeout
to 2 then the VMs failover to another host within 5 mins.
I understand what Host HA is meant for but it seems for a failed host in terms
of power it doesn't work.
Jon