Hi Ulrich.  You have summed it up exactly and the chances seem small but
in the real world (Murphy's Law I guess) I have hit this many times. 
Twice to the point where I have mangled a Production VM to the point of
garbage.  The larger the available free memory on the cluster as a whole
seems to make a big difference because there seems to be a much greater
chance of the cluster deciding to move a dead vm while it is rebooting. 

Thanks for paying attention to this issue (not really a bug) as I am
sure I am not the only one with this issue.  For now I have set all my
VMs to destroy so that the cluster is the only thing managing them but
this is not super clean as I get failures in my logs that are not really
failures.

Tom


On 09/30/2013 07:56 AM, Ulrich Windl wrote:
> Hi!
>
> With Xen paravirtualization, when a VM (guest) is rebootet (e.g. via guest's 
> "reboot"), the actual "VM" (which doesn't really exist as a concept in 
> paravirtualization) is destroyed for a moment and then is recreated (AFAIK). 
> That's why "xm console" does not survive a guest reboot, and that's why a RA 
> may see the guest is gone for a moment before it's recreated.
>
> A clean fix would be in Xen to keep the guest in "xm list" during reboot.
>
> The chances to be hit by the problem are small, but when hit, the 
> consequences are bad.
>
> Regards,
> Ulrich
>
>>>> Ferenc Wagner <[email protected]> schrieb am 17.09.2013 um 11:38 in Nachricht
> <[email protected]>:
>> Lars Marowsky-Bree <[email protected]> writes:
>>
>>> The RA thinks the guest is gone, the cluster reacts and schedules it
>>> to be started (perhaps elsewhere); and then the hypervisor starts it
>>> locally again *too*.
>>>
>>> I think changing those libvirt settings to "destroy" could work - the
>>> cluster will then restart the guest appropriately, not the hypervisor.
>> Maybe the RA is just too picky about the reported VM state.  This is one
>> of the reasons* I'm using my own RA for managing libvirt virtual
>> domains: mine does not care about the fine points, if the domain is
>> active in any state, it's running, as far as the RA is concerned, so a
>> domain reset is not a cluster event in any case.
>>
>> On the other hand, doesn't the recover action after a monitor failure
>> consist of a stop action on the original host before the new start, just
>> to make sure?  Or maybe I'm confusing things...
>>
>> Regards,
>> Feri.
>>
>> * Another is that mine gets the VM definition as a parameter, not via
>>   some shared filesystem.
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected] 
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
>> See also: http://linux-ha.org/ReportingProblems 
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to