Hi Ulrich. You have summed it up exactly and the chances seem small but in the real world (Murphy's Law I guess) I have hit this many times. Twice to the point where I have mangled a Production VM to the point of garbage. The larger the available free memory on the cluster as a whole seems to make a big difference because there seems to be a much greater chance of the cluster deciding to move a dead vm while it is rebooting.
Thanks for paying attention to this issue (not really a bug) as I am sure I am not the only one with this issue. For now I have set all my VMs to destroy so that the cluster is the only thing managing them but this is not super clean as I get failures in my logs that are not really failures. Tom On 09/30/2013 07:56 AM, Ulrich Windl wrote: > Hi! > > With Xen paravirtualization, when a VM (guest) is rebootet (e.g. via guest's > "reboot"), the actual "VM" (which doesn't really exist as a concept in > paravirtualization) is destroyed for a moment and then is recreated (AFAIK). > That's why "xm console" does not survive a guest reboot, and that's why a RA > may see the guest is gone for a moment before it's recreated. > > A clean fix would be in Xen to keep the guest in "xm list" during reboot. > > The chances to be hit by the problem are small, but when hit, the > consequences are bad. > > Regards, > Ulrich > >>>> Ferenc Wagner <[email protected]> schrieb am 17.09.2013 um 11:38 in Nachricht > <[email protected]>: >> Lars Marowsky-Bree <[email protected]> writes: >> >>> The RA thinks the guest is gone, the cluster reacts and schedules it >>> to be started (perhaps elsewhere); and then the hypervisor starts it >>> locally again *too*. >>> >>> I think changing those libvirt settings to "destroy" could work - the >>> cluster will then restart the guest appropriately, not the hypervisor. >> Maybe the RA is just too picky about the reported VM state. This is one >> of the reasons* I'm using my own RA for managing libvirt virtual >> domains: mine does not care about the fine points, if the domain is >> active in any state, it's running, as far as the RA is concerned, so a >> domain reset is not a cluster event in any case. >> >> On the other hand, doesn't the recover action after a monitor failure >> consist of a stop action on the original host before the new start, just >> to make sure? Or maybe I'm confusing things... >> >> Regards, >> Feri. >> >> * Another is that mine gets the VM definition as a parameter, not via >> some shared filesystem. >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
