[ https://issues.apache.org/jira/browse/CLOUDSTACK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585619#comment-16585619 ]
ASF subversion and git services commented on CLOUDSTACK-10310: -------------------------------------------------------------- Commit 023dcec5ef2e38091c0aacda1e0fae67fd6c4553 in cloudstack's branch refs/heads/4.11 from Slair1 [ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=023dcec ] CLOUDSTACK-10310 Fix KVM reboot on storage issue (#2722) > KVM hosts reboot if there is a short transient storage error > ------------------------------------------------------------ > > Key: CLOUDSTACK-10310 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310 > Project: CloudStack > Issue Type: Improvement > Security Level: Public(Anyone can view this level - this is the > default.) > Components: KVM > Affects Versions: 4.9.0, 4.10.0.0 > Reporter: Sean Lair > Priority: Major > > If the KVM heartbeat file can't be written to, the host is rebooted, and thus > taking down all VMs running on it. The code does try 5x times before the > reboot, but the there is not a delay between the retires, so they are 5 > simultaneous retries, which doesn't help. Standard SAN storage HA operations > or quick network blip could cause this reboot to occur. > Some discussions on the dev mailing list revealed that some people are just > commenting out the reboot line in their version of the CloudStack source. > A better option (and a new PR is being issued) would be have it sleep between > tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting, > the cloudstack-agent could just be stopped on the host instead. This will > cause alerts to be issued and if the host is disconnected long-enough, > depending on the HA code in use, VM HA could handle the host failure. > The built-in reboot of the host seemed drastic -- This message was sent by Atlassian JIRA (v7.6.3#76005)