jmsperu opened a new issue, #13089: URL: https://github.com/apache/cloudstack/issues/13089
### Description The KVM agent's storage heartbeat scripts (`kvmheartbeat.sh`, `kvmspheartbeat.sh`) hard-code an immediate kernel-level reboot via `echo b > /proc/sysrq-trigger` when a heartbeat write to primary storage times out. This: 1. Bypasses all OS-level shutdown protections (no clean filesystem unmount, no graceful VM stop) 2. Drops ALL running VMs on the host instantly 3. Triggers HA cascade — surviving hosts get flooded with VM restart requests 4. **Cannot be disabled or configured** without forking the script ### Affected files - `scripts/vm/hypervisor/kvm/kvmheartbeat.sh` (around line 162) - `scripts/vm/hypervisor/kvm/kvmspheartbeat.sh` (around line 64) ```sh # Both scripts contain: /usr/bin/logger -t heartbeat "...will reboot system..." sync & sleep 5 echo b > /proc/sysrq-trigger # <-- hardcoded panic-reboot ``` ### Reproduction 1. CloudStack 4.22.0.0 KVM hypervisor running Ubuntu 24.04 2. Primary storage: **LINSTOR with DRBD** replication (Linstor storage pool) 3. Trigger parallel DRBD resyncs across many resources (e.g. after a peer-node reboot, or `linstor resource resume-sync` after a maintenance pause) 4. While resync I/O contends with normal I/O on the same disks, the heartbeat write to its `hb` file occasionally takes longer than the hardcoded timeout 5. The host immediately force-reboots itself via sysrq, even though no actual fault exists ### Real-world impact We hit this 5+ times within 4 hours during recovery from an unrelated incident. Cascade pattern: - Heartbeat times out on host A → sysrq reboot → 90+ VMs dropped - HA worker reschedules those VMs onto host B - Host B's I/O spikes → its heartbeat times out → sysrq reboot → all its VMs dropped - Host A comes back, repeat Each reboot took ~3 minutes; total customer-visible outage was several hours. Patching the script line out (`echo b > /proc/sysrq-trigger` → log-only) immediately stopped the cascade. The exact log line preceding each reboot: ``` heartbeat[68685]: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage. ``` ### Affected versions - CloudStack: **4.22.0.0** (cloudstack-agent / cloudstack-common 4.22.0.0) - Hypervisor: Ubuntu 24.04 LTS / KVM, libvirt 10.x - Storage: LINSTOR 1.33.2 / DRBD 9.3.1 (LINBIT public PPA) - Reproduces on both single-cluster and multi-cluster zones ### Why this is a design issue The assumption "heartbeat write timeout = host is dead" was reasonable for **NFS shared storage** where transient I/O latency is rare and the only failure mode of concern is split-brain during a real network partition. With **LINSTOR/DRBD** (or really any local storage doing replication), the same disk serves application I/O, replication I/O, and heartbeat I/O — heartbeat can be transiently delayed without the host being dead. A fence-on-failure mechanism shouldn't be a binary panic-button: it should at minimum be **configurable**, and ideally **graceful** by default. ### Proposed fix Make the fence action configurable in `agent.properties`: ```properties # Action when storage heartbeat write fails persistently # Values: reboot | graceful-reboot | restart-agent | log-only kvm.heartbeat.fence.action=graceful-reboot # Number of consecutive heartbeat failures before fencing (default 1 today) kvm.heartbeat.fence.consecutive.failures=3 # Cooldown between possible fence actions (avoid reboot loops) kvm.heartbeat.fence.cooldown.seconds=300 ``` Action semantics: - `reboot` — current behavior (sysrq-trigger), kept for backward compat - `graceful-reboot` — `systemctl reboot` instead of sysrq, lets VMs stop cleanly - `restart-agent` — restart `cloudstack-agent` + `libvirtd` only; running VMs survive - `log-only` — log + alert, take no automatic action (admin investigates) This lets operators choose a policy appropriate for their storage backend. `log-only` would have prevented the cascade we hit; `restart-agent` would have been a reasonable middle ground. ### Workarounds (in case anyone else hits this before a fix lands) 1. **Patch both scripts** to replace `echo b > /proc/sysrq-trigger` with a `logger` line. Survives until the next `cloudstack-common` package upgrade. 2. **Move heartbeat target to host-local storage** that's not under DRBD I/O contention. 3. **Increase heartbeat timeout** via agent properties (mitigates frequency but doesn't eliminate the binary failure mode). ### Willingness to contribute We have a working in-place patch in production right now (lines neutered, fail2ban added to protect SSH). Happy to open a follow-up PR implementing the configurable fence-action design above if there's interest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
