Hello everyone, I would like to open a discussion to change the default value of the agent property `reboot.host.and.alert.management.on.heartbeat.timeout` to false.
The default behaviour of the kvm agent is to check the storage heartbeat and, if it timeouts, to check the `reboot.host.and.alert.management.on.heartbeat.timeout` and restart the host if it is true. The default value of this property is true. This behaviour is independent of the host HA setting in management, and the agent will reboot the host even if the host HA is not enabled. Such behaviour can create several problems. If the primary storage is temporarily not accessible, all hosts could reboot. Another issue with HCI deployments is that if there is a temporary issue with the storage or with the heartbeat check, this will cause a cyclic reboot of all hosts, preventing the cluster from restoring its operational state. Note that this parameter is not part of the host HA mechanism. The CloudStack management server has other mechanisms to reboot and fence the host in case host HA is enabled. Self-rebooting the host by the agent has very specific use cases, if any, and is not suitable for the typical setups. Thus, the proposal is to change the default value to false and leave it to the user to enable the agent to reboot the host only explicitly. The proposal is expected to improve the overall availability of deployed CloudStack clouds. Please let me know your thoughts about the proposal. Best regards, Slavka