jmsperu opened a new pull request, #13090:
URL: https://github.com/apache/cloudstack/pull/13090

   ## Description
   
   Fixes #13089
   
   The KVM agent's storage heartbeat scripts (`kvmheartbeat.sh` and 
`kvmspheartbeat.sh`) hard-code an immediate kernel-level reboot via `echo b > 
/proc/sysrq-trigger` when a heartbeat write to primary storage times out.
   
   This works fine for NFS-backed primary storage where transient I/O latency 
is rare, but causes **false-positive host fencing** on LINSTOR/DRBD (and any 
replicated local storage), because the same disk simultaneously serves 
application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O 
burst can transiently delay the heartbeat write enough to trip the fence — and 
the host is force-rebooted with no real fault.
   
   We hit this in production on 4.22.0.0 multiple times during a single 
incident; each false-positive sysrq drops every running VM on the host and 
cascades onto the surviving peer.
   
   ## Change
   
   Adds a new agent property `kvm.heartbeat.fence.action` (read by both 
heartbeat scripts directly from `/etc/cloudstack/agent/agent.properties`):
   
   | Value | Behavior |
   |---|---|
   | `reboot` (default) | Original behavior: `echo b > /proc/sysrq-trigger` |
   | `graceful-reboot` | `systemctl reboot` — allows running VMs to stop 
cleanly |
   | `restart-agent` | Restart `cloudstack-agent` only; running VMs preserved |
   | `log-only` | Log + alert, no automatic action (admin investigates) |
   
   Default is `reboot` so existing deployments keep current behavior. Operators 
on replicated-storage backends can pick a less destructive action.
   
   The existing `reboot.host.and.alert.management.on.heartbeat.timeout` boolean 
continues to work unchanged as a complete Java-side bypass — this PR is 
additive.
   
   ## Files changed
   
   - `scripts/vm/hypervisor/kvm/kvmheartbeat.sh` — read the property, dispatch 
on action
   - `scripts/vm/hypervisor/kvm/kvmspheartbeat.sh` — same
   - `agent/conf/agent.properties` — document the new property
   - `agent/src/main/java/com/cloud/agent/properties/AgentProperties.java` — 
add Java-side property entry for tooling/discoverability
   
   ## Backward compatibility
   
   - Default action is `reboot`, identical to current behavior
   - Property is read with `tail -n 1` so duplicate entries take the last value
   - If the property file is unreadable or the value is unrecognized, falls 
back to `reboot`
   - No Java-side runtime change — the existing boolean 
(`reboot.host.and.alert.management.on.heartbeat.timeout`) continues to work as 
before
   
   ## Testing
   
   - `reboot` (default) — verified produces same output as before via `bash -x` 
trace; sysrq path unchanged
   - `log-only` — verified the script exits 0 with logger entry, no 
reboot/agent-restart attempted
   - `restart-agent` — verified `systemctl restart cloudstack-agent` is invoked
   - `graceful-reboot` — verified `systemctl reboot` is invoked instead of sysrq
   
   In production we have been running with the fence path neutered (equivalent 
to `log-only`) for several hours since the incident, with no impact on cluster 
health — the host stays up while DRBD resyncs background-complete normally, and 
the previous false-positive cascade has not recurred.
   
   ## Related
   
   - Issue: #13089
   - Affected versions: 4.22.0.0 (likely earlier; the script section is 
unchanged for many releases)
   - Triggered by: LINSTOR/DRBD primary storage with active resyncs, but 
applies to any replicated local storage
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to