jmsperu opened a new issue, #13089:
URL: https://github.com/apache/cloudstack/issues/13089

   ### Description
   
   The KVM agent's storage heartbeat scripts (`kvmheartbeat.sh`, 
`kvmspheartbeat.sh`) hard-code an immediate kernel-level reboot via `echo b > 
/proc/sysrq-trigger` when a heartbeat write to primary storage times out. This:
   
   1. Bypasses all OS-level shutdown protections (no clean filesystem unmount, 
no graceful VM stop)
   2. Drops ALL running VMs on the host instantly
   3. Triggers HA cascade — surviving hosts get flooded with VM restart requests
   4. **Cannot be disabled or configured** without forking the script
   
   ### Affected files
   
   - `scripts/vm/hypervisor/kvm/kvmheartbeat.sh` (around line 162)
   - `scripts/vm/hypervisor/kvm/kvmspheartbeat.sh` (around line 64)
   
   ```sh
   # Both scripts contain:
   /usr/bin/logger -t heartbeat "...will reboot system..."
   sync &
   sleep 5
   echo b > /proc/sysrq-trigger     # <-- hardcoded panic-reboot
   ```
   
   ### Reproduction
   
   1. CloudStack 4.22.0.0 KVM hypervisor running Ubuntu 24.04
   2. Primary storage: **LINSTOR with DRBD** replication (Linstor storage pool)
   3. Trigger parallel DRBD resyncs across many resources (e.g. after a 
peer-node reboot, or `linstor resource resume-sync` after a maintenance pause)
   4. While resync I/O contends with normal I/O on the same disks, the 
heartbeat write to its `hb` file occasionally takes longer than the hardcoded 
timeout
   5. The host immediately force-reboots itself via sysrq, even though no 
actual fault exists
   
   ### Real-world impact
   
   We hit this 5+ times within 4 hours during recovery from an unrelated 
incident. Cascade pattern:
   
   - Heartbeat times out on host A → sysrq reboot → 90+ VMs dropped
   - HA worker reschedules those VMs onto host B
   - Host B's I/O spikes → its heartbeat times out → sysrq reboot → all its VMs 
dropped
   - Host A comes back, repeat
   
   Each reboot took ~3 minutes; total customer-visible outage was several 
hours. Patching the script line out (`echo b > /proc/sysrq-trigger` → log-only) 
immediately stopped the cascade.
   
   The exact log line preceding each reboot:
   
   ```
   heartbeat[68685]: kvmspheartbeat.sh will reboot system because it was unable 
to write the heartbeat to the storage.
   ```
   
   ### Affected versions
   
   - CloudStack: **4.22.0.0** (cloudstack-agent / cloudstack-common 4.22.0.0)
   - Hypervisor: Ubuntu 24.04 LTS / KVM, libvirt 10.x
   - Storage: LINSTOR 1.33.2 / DRBD 9.3.1 (LINBIT public PPA)
   - Reproduces on both single-cluster and multi-cluster zones
   
   ### Why this is a design issue
   
   The assumption "heartbeat write timeout = host is dead" was reasonable for 
**NFS shared storage** where transient I/O latency is rare and the only failure 
mode of concern is split-brain during a real network partition.
   
   With **LINSTOR/DRBD** (or really any local storage doing replication), the 
same disk serves application I/O, replication I/O, and heartbeat I/O — 
heartbeat can be transiently delayed without the host being dead. A 
fence-on-failure mechanism shouldn't be a binary panic-button: it should at 
minimum be **configurable**, and ideally **graceful** by default.
   
   ### Proposed fix
   
   Make the fence action configurable in `agent.properties`:
   
   ```properties
   # Action when storage heartbeat write fails persistently
   # Values: reboot | graceful-reboot | restart-agent | log-only
   kvm.heartbeat.fence.action=graceful-reboot
   
   # Number of consecutive heartbeat failures before fencing (default 1 today)
   kvm.heartbeat.fence.consecutive.failures=3
   
   # Cooldown between possible fence actions (avoid reboot loops)
   kvm.heartbeat.fence.cooldown.seconds=300
   ```
   
   Action semantics:
   
   - `reboot` — current behavior (sysrq-trigger), kept for backward compat
   - `graceful-reboot` — `systemctl reboot` instead of sysrq, lets VMs stop 
cleanly
   - `restart-agent` — restart `cloudstack-agent` + `libvirtd` only; running 
VMs survive
   - `log-only` — log + alert, take no automatic action (admin investigates)
   
   This lets operators choose a policy appropriate for their storage backend. 
`log-only` would have prevented the cascade we hit; `restart-agent` would have 
been a reasonable middle ground.
   
   ### Workarounds (in case anyone else hits this before a fix lands)
   
   1. **Patch both scripts** to replace `echo b > /proc/sysrq-trigger` with a 
`logger` line. Survives until the next `cloudstack-common` package upgrade.
   2. **Move heartbeat target to host-local storage** that's not under DRBD I/O 
contention.
   3. **Increase heartbeat timeout** via agent properties (mitigates frequency 
but doesn't eliminate the binary failure mode).
   
   ### Willingness to contribute
   
   We have a working in-place patch in production right now (lines neutered, 
fail2ban added to protect SSH). Happy to open a follow-up PR implementing the 
configurable fence-action design above if there's interest.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to