jmsperu opened a new pull request, #13090: URL: https://github.com/apache/cloudstack/pull/13090
## Description Fixes #13089 The KVM agent's storage heartbeat scripts (`kvmheartbeat.sh` and `kvmspheartbeat.sh`) hard-code an immediate kernel-level reboot via `echo b > /proc/sysrq-trigger` when a heartbeat write to primary storage times out. This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes **false-positive host fencing** on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault. We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer. ## Change Adds a new agent property `kvm.heartbeat.fence.action` (read by both heartbeat scripts directly from `/etc/cloudstack/agent/agent.properties`): | Value | Behavior | |---|---| | `reboot` (default) | Original behavior: `echo b > /proc/sysrq-trigger` | | `graceful-reboot` | `systemctl reboot` — allows running VMs to stop cleanly | | `restart-agent` | Restart `cloudstack-agent` only; running VMs preserved | | `log-only` | Log + alert, no automatic action (admin investigates) | Default is `reboot` so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action. The existing `reboot.host.and.alert.management.on.heartbeat.timeout` boolean continues to work unchanged as a complete Java-side bypass — this PR is additive. ## Files changed - `scripts/vm/hypervisor/kvm/kvmheartbeat.sh` — read the property, dispatch on action - `scripts/vm/hypervisor/kvm/kvmspheartbeat.sh` — same - `agent/conf/agent.properties` — document the new property - `agent/src/main/java/com/cloud/agent/properties/AgentProperties.java` — add Java-side property entry for tooling/discoverability ## Backward compatibility - Default action is `reboot`, identical to current behavior - Property is read with `tail -n 1` so duplicate entries take the last value - If the property file is unreadable or the value is unrecognized, falls back to `reboot` - No Java-side runtime change — the existing boolean (`reboot.host.and.alert.management.on.heartbeat.timeout`) continues to work as before ## Testing - `reboot` (default) — verified produces same output as before via `bash -x` trace; sysrq path unchanged - `log-only` — verified the script exits 0 with logger entry, no reboot/agent-restart attempted - `restart-agent` — verified `systemctl restart cloudstack-agent` is invoked - `graceful-reboot` — verified `systemctl reboot` is invoked instead of sysrq In production we have been running with the fence path neutered (equivalent to `log-only`) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred. ## Related - Issue: #13089 - Affected versions: 4.22.0.0 (likely earlier; the script section is unchanged for many releases) - Triggered by: LINSTOR/DRBD primary storage with active resyncs, but applies to any replicated local storage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
