andrijapanicsb opened a new issue, #13376:
URL: https://github.com/apache/cloudstack/issues/13376

   # Host-HA never marks a powered-off KVM host `Down` because the fence (OOBM 
power-off) can't succeed against an already-off chassis — VM-HA only triggers 
once the dead host is powered back on
   
   ## ISSUE TYPE
   - Bug Report
   
   ## COMPONENT NAME
   ~~~
   HA (host-HA framework), Out-of-band Management (Redfish/IPMI), KVM
   ~~~
   
   ## CLOUDSTACK VERSION
   ~~~
   Confirmed present with identical (or functionally identical) logic on:
     - tag    4.22.1.0  (analyzed in detail)
     - branch 4.22      (origin/4.22 @ 21b2025c) — all key files byte-identical 
to 4.22.1.0
     - branch main      (origin/main @ 6bc83a3c) — all key files byte-identical 
to 4.22.1.0
     - branch 4.20      (origin/4.20 @ a3970bb1) — same logic; differences are 
cosmetic only
                          (method rename getHostStatus() -> 
getHostStatusFromHAConfig(); logger formatting)
   
   There is NO 4.21 release branch upstream (release branches go 4.20 -> 4.22).
   The host-HA + OOBM fence design predates 4.20, so earlier 4.x releases are 
very likely affected too.
   
   Per-branch verification of the relevant elements:
     - KVMHAProvider.fence() = OOBM PowerOperation.OFF, returns 
resp.getSuccess(): same on 4.20 / 4.22 / main
     - FenceTask: only transitions to Fenced on success; retries Fencing 
otherwise: byte-identical on 4.20 / 4.22 / main
     - HAManagerImpl host-status mapping (Fenced->Down, Fencing->Disconnected): 
same on 4.20 (getHostStatus) and 4.22/main (getHostStatusFromHAConfig)
     - RedfishWrapper: PowerOperation.OFF -> RedfishResetCmd.GracefulShutdown: 
byte-identical on 4.20 / 4.22 / main
     - RedfishClient: throws unless HTTP status in 2XX 
(SC_OK..SC_MULTIPLE_CHOICES): byte-identical on 4.20 / 4.22 / main
   ~~~
   
   ## CONFIGURATION
   - KVM cluster with **host-HA enabled** on the hosts.
   - **Out-of-band Management enabled** per host (reproduced with the 
**Redfish** driver against Dell iDRAC; the same logic applies to the 
**ipmitool** driver).
   - VM-HA enabled (`VmHaEnabled`).
   - Primary storage: Linstor (not material — `isStorageSupportHA() == true`, 
so the legacy investigator is not the bottleneck here).
   
   ## OS / ENVIRONMENT
   - Management servers: Ubuntu 24.04, OpenJDK 21.
   - Hypervisors: KVM.
   - BMC: Dell iDRAC via Redfish (`/redfish/v1/Systems/System.Embedded.1`).
   
   ## SUMMARY
   
   When a KVM host that has host-HA + OOBM enabled is **hard powered off** 
(e.g. forced chassis-off from the BMC console, or a real power/cable failure), 
CloudStack **never transitions the host to `Down`** and therefore **never 
restarts its VMs on other hosts**. The host stays in `Alert`/`Disconnected` 
indefinitely.
   
   Root cause: the host-HA state machine only declares a host dead 
(`HAState.Fenced` → investigator `Status.Down`) **after a successful fence**, 
and the fence is implemented as an **active OOBM power-off**. Against an 
already-off chassis that power-off cannot succeed (the BMC rejects it), so the 
host is pinned in the `Fencing` state and retried forever. The investigator 
maps `Fencing` to `Status.Disconnected`, not `Status.Down`, so VM-HA is never 
invoked.
   
   The perverse result: **the VMs are only recovered once the original (dead) 
host is powered back on** — at which point the pending power-off finally 
succeeds, the host transitions to `Fenced`/`Down`, and HA restarts the VMs 
elsewhere. This defeats the purpose of HA.
   
   **All three current branches are affected by the identical issue:** the 
relevant code is byte-identical on `4.22` and `main`, and functionally 
identical on `4.20` (only a method rename and logger formatting differ). There 
is no `4.21` branch upstream. Per-element diff verification is in the 
CLOUDSTACK VERSION section below.
   
   ## STEPS TO REPRODUCE
   
   1. KVM cluster, host-HA enabled, OOBM (Redfish or ipmitool) configured and 
enabled on the hosts, VM-HA enabled. Place some HA-enabled VMs (incl. system 
VMs) on `hostA`.
   2. Forcefully power off `hostA` at the BMC (chassis power off / simulate 
power loss). The BMC itself stays reachable.
   3. Observe `hostA` in CloudStack over the next 20+ minutes.
   
   ### EXPECTED RESULTS
   - Health check fails → activity check fails → host is fenced → host marked 
`Down` → VM-HA restarts `hostA`'s VMs on other hosts within a few minutes.
   
   ### ACTUAL RESULTS
   - `hostA` remains in `Alert` (host status) with the host-HA state stuck in 
`Fencing`.
   - The OOBM **STATUS** poll correctly reports the chassis as `Off` the entire 
time, but that knowledge is never used to declare the host down.
   - The agent investigator repeatedly reports the host as `Up` (while HA state 
is `Suspect`) and then `Disconnected` (while HA state is `Fencing`) — **never 
`Down`**.
   - VMs are **not** restarted; the scheduler keeps preferring the VM's last 
host (the dead `hostA`).
   - The instant `hostA` is powered back **on**, the fence power-off finally 
succeeds → host goes `Down` → VM-HA restarts the VMs on other hosts.
   
   ## ROOT CAUSE ANALYSIS
   
   ### Decision chain (only `Fenced` yields `Down`)
   
   1. For an HA-eligible KVM host, the legacy investigator delegates to the 
host-HA framework:
      - `KVMInvestigator.getHostAgentStatus()` → 
`haManager.getHostStatusFromHAConfig(host)`
        
(`plugins/hypervisors/kvm/src/main/java/com/cloud/ha/KVMInvestigator.java:81`)
   2. `HAManagerImpl.getHostStatusFromHAConfig()` maps HA state → host status
      (`server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:315`):
      - `Fenced` → `Status.Down`
      - `Degraded` / `Recovering` / `Fencing` → `Status.Disconnected`
      - everything else (`Available`/`Suspect`/`Checking`/`Recovered`) → 
`Status.Up`
   3. `AgentManagerImpl` only fires the `HostDown` event and 
`scheduleRestartForVmsOnHost(...)` when the investigator returns `Status.Down`
      
(`engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java:1147`,
 `:1200`).
   
   So VM-HA for an HA-eligible KVM host requires the host-HA state machine to 
reach **`Fenced`**.
   
   ### Reaching `Fenced` requires a *successful* power-off
   
   - The state machine only goes `Fencing → Fenced` on `Event.Fenced`
     (`api/src/main/java/org/apache/cloudstack/ha/HAConfig.java:139`).
   - `FenceTask.processResult()` only fires `Event.Fenced` when the fence 
returned `true`; otherwise it does nothing and the poll loop retries `Fencing` 
forever via `RetryFencing`
     (`server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:45`; 
retry at 
`server/src/main/java/org/apache/cloudstack/ha/HAManagerImpl.java:724`).
   - The fence is an active OOBM power-off:
     `KVMHAProvider.fence()` → 
`outOfBandManagementService.executePowerOperation(host, PowerOperation.OFF, 
null)` and returns `resp.getSuccess()`
     
(`plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:87`).
   - `executePowerOperation()` **throws** `CloudRuntimeException` whenever the 
driver response is not successful — it never returns `success=false`
     
(`server/src/main/java/org/apache/cloudstack/outofbandmanagement/OutOfBandManagementServiceImpl.java:432`).
   
   ### Why the power-off fails against an already-off host (Redfish)
   
   - The Redfish driver maps `PowerOperation.OFF` → 
`RedfishResetCmd.GracefulShutdown`
     
(`plugins/outofbandmanagement-drivers/redfish/src/main/java/org/apache/cloudstack/outofbandmanagement/driver/redfish/RedfishWrapper.java:34`).
   - `RedfishClient.executeComputerSystemReset()` POSTs to 
`.../Actions/ComputerSystem.Reset` and throws `RedfishException` if the HTTP 
status is not 2XX
     
(`utils/src/main/java/org/apache/cloudstack/utils/redfish/RedfishClient.java:300-312`).
   - An already-off system returns **HTTP 409 (Conflict)** — a 
`GracefulShutdown` is invalid because there is no running OS to shut down. 409 
∉ 2XX → `RedfishException` → `CloudRuntimeException` → `HAFenceException` → 
`FenceTask` sees `result=false` → **no `Fenced` transition** → stuck in 
`Fencing`.
   - (The ipmitool driver has the analogous failure mode: `chassis power off` 
against an already-off / unreachable BMC returns a non-zero exit code, judged 
purely by process exit status with no "already in target state" handling — 
`IpmitoolWrapper.executeCommands()` → `result.isSuccess()`.)
   
   ### Net effect
   
   The fence requires confirming an **active power-off transition**, but a host 
that is already off (precisely the case where restarting its VMs is safe) 
cannot be "powered off successfully." The safety mechanism deadlocks in exactly 
the scenario it exists to handle. VMs recover only when the dead host returns.
   
   ## LOG EVIDENCE (two-MS cluster; host `kvm-host01`, id:1, uuid 
`aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee`; hostnames/IPs/VM names below are 
anonymized examples)
   
   OOBM STATUS poll knew the chassis was off the whole time (MS #1 log):
   ~~~
   15:10:47  OutOfBandManagementServiceImpl  Transitioned out-of-band 
management power state from On to Off
             due to event: Off(Chassis Power is Off) for Host {id:1, kvm-host01}
   ~~~
   
   Investigator never returns Down — `Up` while `Suspect`, then `Disconnected` 
while `Fencing` (MS #1 log):
   ~~~
   15:07:51  KVMInvestigator was able to determine host {id:1} is in Up
             ... is considered Up (...). State: Suspect, Most recent health 
check failed.
   15:14:51  HAManagerImpl  HA: Agent [{id:1}] is disconnected. State: Fencing, 
The resource is undergoing fence operation.
   ~~~
   
   The fence itself, on the MS node that owns the HA config (MS #2 log) — 
repeated every ~4s for ~20 min:
   ~~~
   15:14:20  (first) ... it got '409'
   15:14:28  KVMHAProvider  OOBM service is not configured or enabled for this 
host {id:1} error is
             Failed to execute System power command ... 'POST' ...
             '.../Actions/ComputerSystem.Reset' ... The expected HTTP status 
code is '2XX' but it got '409'.
   15:14:28  FenceTask  Exception occurred while running FenceTask ...
             org.apache.cloudstack.ha.provider.HAFenceException ... at 
KVMHAProvider.fence(KVMHAProvider.java:99)
   ~~~
   Counts over the outage: ~618 × `409`, ~308 × `HAFenceException`, 930 × 
`Fencing` state lines, `Starting HA on ... = 1` (only at the very end).
   
   VM-HA only fires after the host is powered back on (MS #2 log):
   ~~~
   15:35:03  HighAvailabilityManagerExtImpl  Scheduling restart for VMs on host 
{id:1, kvm-host01}
   15:35:03  Host [kvm-host01 (id:1) ...] is down.  Starting HA on the 
following VMs: vm-app01 vm-app02
   ~~~
   (chassis Off→On detected ~15:35:05 in MS #1 log.)
   
   ## SECONDARY BUGS surfaced by this incident
   
   1. **Misleading error message.** Every fence failure logs `OOBM service is 
not configured or enabled for this host ...`, but OOBM *is* configured and 
working. The catch-all in `KVMHAProvider.fence()` 
(`plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/kvm/ha/KVMHAProvider.java:97-100`)
 assumes any exception means "OOBM not configured," hiding the real cause (HTTP 
409 / already off). This actively misdirects troubleshooting.
   
   2. **Misleading "fencing performed" alerts.** Each *failed* fence attempt 
emits `alertType=30 — "HA Fencing of host id=1 ... performed"` because 
`FenceTask.processResult()` calls `sendAlert(resource, HAState.Fencing)` 
unconditionally regardless of `result` 
(`server/src/main/java/org/apache/cloudstack/ha/task/FenceTask.java:54`). 
Admins receive a flood of "fencing performed" alerts while fencing is in fact 
failing continuously.
   
   ## SUGGESTED FIX (direction)
   
   Make fencing treat "host is already off" as a successful fence, and stop 
hiding the real error:
   
   1. In `KVMHAProvider.fence()`, query OOBM power **STATUS** first; if the 
chassis is already `Off`, return `true` (host is effectively fenced) instead of 
issuing a power-off that 409s. (A confirmed-off host is safe to declare fenced.)
   2. Redfish driver: treat an idempotent power-off (target state already 
reached, HTTP 409 on `GracefulShutdown`/`ForceOff` when already off) as 
success; and/or prefer `ForceOff` over `GracefulShutdown` for the HA fence path.
   3. Fix the `fence()` catch block to surface the actual driver error rather 
than "OOBM not configured."
   4. Make `FenceTask` alerts reflect actual success/failure of the fence.
   
   ## NOTES
   - Analyzed against git tag `4.22.1.0`.
   - Storage (Linstor) is not the bottleneck: 
`LinstorPrimaryDataStoreDriverImpl.isStorageSupportHA()` returns `true`, so the 
legacy KVM investigator does not short-circuit; the host-HA framework path 
(above) is in effect.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to