Hello everyone,

I would like to open a discussion to change the default value of the agent
property `reboot.host.and.alert.management.on.heartbeat.timeout` to false.

The default behaviour of the kvm agent is to check the storage heartbeat
and, if it timeouts, to check the
`reboot.host.and.alert.management.on.heartbeat.timeout` and restart the
host if it is true. The default value of this property is true.

This behaviour is independent of the host HA setting in management, and the
agent will reboot the host even if the host HA is not enabled.

Such behaviour can create several problems. If the primary storage is
temporarily not accessible, all hosts could reboot.

Another issue with HCI deployments is that if there is a temporary issue
with the storage or with the heartbeat check, this will cause a cyclic
reboot of all hosts, preventing the cluster from restoring its operational
state.

Note that this parameter is not part of the host HA mechanism. The
CloudStack management server has other mechanisms to reboot and fence the
host in case host HA is enabled.

Self-rebooting the host by the agent has very specific use cases, if any,
and is not suitable for the typical setups. Thus, the proposal is to change
the default value to false and leave it to the user to enable the agent to
reboot the host only explicitly. The proposal is expected to improve the
overall availability of deployed CloudStack clouds.

Please let me know your thoughts about the proposal.

Best regards,

Slavka

Reply via email to