andrijapanicsb commented on PR #13377:
URL: https://github.com/apache/cloudstack/pull/13377#issuecomment-4676871111
## TL;DR --> hw/sw setup + what was tested + clock-measured timing results
for VM HA to kick in:
- KMV host: HPE DL360 / iLO5 (Ubuntu 24.04 for both mgmt and kvm host)
- Driver: **ipmi** 2.0 (**yet to test RedFish driver** - which is the main
reason for this PR)
- CloudStack: 4.22.1.0 alone vs. this 4.22.1.0+this PR (fat JAR replacing)
- Primary storage: OCFS2 SharedMountPoint
### What was measured/tested:
- Functional testing (feature not broken + **yet to test RedFish, which was
the reason for this patch**)
- NFS Primary storage not tested (and assuming NOT NFSv3 = no locking of
qcow2 = not an important factor/variable)
- Semi-tuning was done (see global config below) due to focus being put on a
completely different thing (and not minimal VM downtime)
- **PR also reduced VM downtime:**
- **down from 8 minutes to 2.5 minutes (confirmed with running test 2
times, not only once)**
- Clock-measured timing/results with AND without this patch/PR
### KVM.ha global configs changed:
| Setting | Test Value | Default Value |
|---|---|---|
| `kvm.ha.health.check.timeout` | 15 | 10 |
| `kvm.ha.activity.check.timeout` | 30 | 60 |
| `kvm.ha.activity.check.interval` | 30 | 60 |
| `kvm.ha.activity.check.max.attempts` | 5 | 10 |
| `kvm.ha.activity.check.failure.ratio` | 0.6 | 0.7 |
| `kvm.ha.degraded.max.period` | 180 | 300 |
| `kvm.ha.recover.wait.period` | 180 | 600 |
| `kvm.ha.fence.timeout` | 120 | 60 |
| **`kvm.ha.recover.failure.threshold`** | 0 | 1 |
The last setting ensures that CloudStack skip one or more attempts to
"recover" the host by using the BMC POWER RESET command (a.k.a tries 0 times)
- it rather fences it immediately via the BMC POWER OFF command (since the host
already has reached "Degraded" state and needs help - kill or fix)
- Testing premise: we don't care about the host being recovered or staying
powered off.
- **We care about minimal VM downtime** when the host is messed up
- (i.e. when declared as "I'm messed up" - STONITH/fence it immediately
and ensure VM-HA kicks in - instead of retrying 1 or more times to reset the
host and NOT trigger VM-HA (we can't guarantee that after that the host will be
fine after the OS re-boot - don't risk long VM downtime during the recovery
period)
# Host HA fencing improvement: handle already-powered-off hosts and reduce
HA VM restart delay
This PR addresses a Host HA fencing scenario observed during testing on a
physical environment using HPE iLO5 / BMC-based out-of-band management with
IPMI driver (yet to test RedFish, which
The test environment was based on Apache CloudStack 4.22.1 with KVM. Primary
storage was configured as CloudStack shared mount point storage backed by an
OCFS2 clustered filesystem, which is now supported for Host HA. Host HA was
enabled only on a single selected host for this test.
On that host, we placed two VMs:
| VM type | HA setting | Expected behavior after host failure |
|---|---:|---|
| HA-enabled VM | Created from a compute offering with HA enabled | Should
be restarted on another suitable host after fencing |
| Non-HA VM | Created from a compute offering without HA | Should remain
stopped and not be restarted automatically |
A fat jar was produced from a branch based directly on the CloudStack 4.22.1
tag. The jar was extracted from the built RPM package and used for testing.
## Scenario being tested
The test intentionally simulated a somewhat unusual but important failure
scenario: the host was manually powered off through the BMC / IPMI / iLO
interface before CloudStack completed its Host HA fencing flow.
This scenario matters because, depending on the out-of-band driver
implementation, sending a power-off command to a chassis that is already
powered off may return an error (Redfish does this, IPMI not affected) or
otherwise be interpreted as a failed fencing operation
The important point is that CloudStack should not treat “the host is already
powered off” as a fencing failure. If the final power state is off, the host is
effectively fenced and VM HA can safely proceed.
## Logic introduced by the patch
The patched logic changes the fencing flow to be state-driven instead of
**relying only on the return status** of the bmc power-off command.
The intended behavior (after host reacheds Degraded state) is:
1. Before sending a power-off command, query the current chassis power state.
2. If the chassis is already powered off, treat the host as already fenced.
3. If the chassis is still powered on, send the power-off command.
4. Do not rely only on the raw command return code.
5. **After the command completes, query the chassis power state again.**
6. If the chassis is confirmed powered off, mark the host as fenced / down
and allow VM HA to proceed.
7. If the chassis is still powered on, fencing should not be considered
successful.
In short: the final observed power state is what matters. If the chassis is
off, the host is fenced.
## Test results
The test confirmed the expected VM HA behavior:
| Test case | Manual chassis power-off time | Host reached Alert state |
Host marked Down / fenced | HA-caused "VM.START" event | Approx. time until HA
restart |
|---|---:|---:|---:|---:|---:|
| Before patch | 16:00:00 | 16:02:30 | 16:07:55 | 16:07:56 | ~7m 56s |
| With patch | 16:16:00 | Not separately recorded | 16:18:39 | 16:18:40 |
~2m 40s |
Before the patch, the host reached Alert state after approximately 2 minutes
and 30 seconds, but it was not marked Down / fenced until 16:07:55. The VM-HA
fired a VM start (for HA-enabled VM only), i.e. VM.SSTART event was observed
one second later, at 16:07:56. This means the HA-enabled VM experienced roughly
8 minutes of downtime before the restart began.
VMs which are not HA-enabled were marked as down (it's debatable if this
"OK" behaviour - if the underlying infra dies, the user still expect his VM to
be running)
With the patched logic (replacing the fat jar), the same type of test was
repeated. The chassis was manually powered off at 16:16:00. The host was marked
Down / fenced at 16:18:39, and the HA-enabled VM start event was observed one
second later, at 16:18:40. This reduced the time before HA restart from roughly
8 minutes to roughly 2 minutes and 40 seconds.
The non-HA VM was not restarted in either case, which is the expected
behavior.
## Result
The patch reduced the observed HA VM restart delay by approximately 5
minutes and 16 seconds in this test scenario.
More importantly, it makes the fencing logic safer and more deterministic:
if the host is already powered off, CloudStack should recognize that condition
as a successful fencing state rather than waiting longer or treating the
operation as failed because the power-off command itself did not behave as
expected (Redfish protocol)
This allows Host HA to proceed much sooner while still preserving the
important safety rule: VM HA should only be triggered after the host has been
confirmed powered off / fenced.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]