Ok, I'm gonna make a bit of noise about this. Hope you guys will chip in so we 
can make some progress re HA in future versions.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

----- Original Message -----
> From: "Simon Weller" <swel...@ena.com>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 23:46:26
> Subject: Re: slow nfs = reboot all hosts (((

> Andrei,
> 
> In a failure scenerio you want to get rid of that problematic server has 
> quickly
> as possible. Effectively this action is fencing the host in question.
> 
> Nux brought up a good point earlier in this thread where ultimately we need to
> figure out a much better way to handling KVM failure conditions. The current
> 'wait until it comes back up' is very much a flawed approach and something
> we've been thinking about internally a lot lately.
> 
> In your case, it sounds like you might need to separate your NFS storage for
> primary and secondary to avoid saturating the primary storage and causing a
> case where the agent believes that the primary NFS is unresponsive.
> 
> We've certainly run into situations previously where the I/O wait state was 
> too
> high on some ISCSI connected hosts and we saw nodes being shot due to access
> times. Our approach to fixing that was reduce the number of VMs being run on
> those hosts and move to higher speed connectivity between the hosts and our
> storage (i.e. FC, 10Gb ethernet).
> 
> - Si
> 
> ________________________________________
> From: Andrei Mikhailovsky <and...@arhont.com>
> Sent: Friday, October 9, 2015 5:37 PM
> To: dev@cloudstack.apache.org
> Subject: Re: slow nfs = reboot all hosts (((
> 
> I think there should be as much REISUB as possible when trying to reboot a
> broken server. Doing only last B bit is a bit dangerous imho.
> 
> Andrei
> ----- Original Message -----
> 
> From: . "Nux!" <n...@li.nux.ro>
> To: dev@cloudstack.apache.org
> Sent: Friday, 9 October, 2015 6:53:43 PM
> Subject: Re: slow nfs = reboot all hosts (((
> 
> Andrei,
> 
> Yes, that command will just reboot without flushing anything to disk, like
> cutting power.
> It is made because many servers are slow to respond to normal reboot commands
> under load, if at all, this could lead to corrupted data and so on.
> The sysrq switch is a much better choice from this pov.
> 
> We really need to look at a proper way of doing HA with KVM.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> ----- Original Message -----
>> From: "Andrei Mikhailovsky" <and...@arhont.com>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 16:47:46
>> Subject: Re: slow nfs = reboot all hosts (((
> 
>> Thanks guys, I am not sure how i've missed that. probably the coffee didn't 
>> kick
>> in yet )))
>>
>> Anyway, am I right in saying that now the host server reboot is now forced
>> without stopping the services, unmounting filesystems with potentially open 
>> and
>> unsync-ed data, etc?
>>
>> Isn't this rather bad and dangerous to perform simply because of
>> slow/unresponsive one of possibly many nfs servers? Not only that, the
>> heartbeat also reboot the servers that are not running vms with nfs volumes? 
>> In
>> my case it just rebooted every single host server.
>>
>> Very worrying indeed.
>>
>> Andrei
>>
>>
>> ----- Original Message -----
>>
>> From: "Nux!" <n...@li.nux.ro>
>> To: dev@cloudstack.apache.org
>> Sent: Friday, 9 October, 2015 12:58:19 PM
>> Subject: Re: slow nfs = reboot all hosts (((
>>
>> Hello,
>>
>> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your 
>> HA
>> at the same time, perhaps there's a way to tweak the timeouts to be more
>> generous with lazy NFS servers.
>>
>> Can you go through the logs and see what is happening before the reboot? I am
>> not sure exactly which timeout the script cares about, worth investigating.
>>
>> Lucian
>>
>> --
>> Sent from the Delta quadrant using Borg technology!
>>
>> Nux!
>> www.nux.ro
>>
>> ----- Original Message -----
>>> From: "Andrija Panic" <andrija.pa...@gmail.com>
>>> To: dev@cloudstack.apache.org
>>> Sent: Friday, 9 October, 2015 10:25:05
>>> Subject: Re: slow nfs = reboot all hosts (((
>>
>>> I managed this problem the folowing way:
>>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/
>>>
>>> Cheers
>>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <and...@arhont.com> wrote:
>>>
>>>> Hello
>>>>
>>>> My issue is whenever my nfs server becomes slow to respond, ACS just
>>>> bloody reboots ALL hosts servers, not just the once running vms with
>>>> volumes attached to the slow nfs server. Recently, i've decided to remove
>>>> some of the old snapshots to free up some disk space. I've deleted about a
>>>> dozen snapshots and I was monitoring the nfs server for progress. At no
>>>> point did the nfs server lost the connectivity, it just became a bit slow
>>>> and under load. By slow I mean i was still able to list files on the nfs
>>>> mount point and the ssh session was still working okay. It was just taking
>>>> a few more seconds to respond when it comes to nfs file listings, creation,
>>>> deletion, etc. However, the ACS agent has just rebooted every single host
>>>> server, killing all running guests and system vms. In my case, I only have
>>>> two guests with volumes on the nfs server. The rest of the vms are running
>>>> off rbd storage. Yet, all host servers were rebooted, even those which were
>>>> not running guests with nfs volumes.
>>>>
>>>> Ever since i've started using ACS, it was always pretty dumb in correctly
>>>> determining if the nfs storage is still alive. I would say it has done the
>>>> maniac reboot everything type of behaviour at least 5 times in the past 3
>>>> years. So, in the previous versions of ACS i've just modified the
>>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were
>>>> just pissing everyone off.
>>>>
>>>> After upgrading to ACS 4.5.x that script has no reboot command and I was
>>>> wondering if it is still possible to instruct the kvmheartbeat script not
>>>> to reboot the host servers?
>>>>
>>>> Thanks for your advice.
>>>>
> > >> Andrei

Reply via email to