Ok, I'm gonna make a bit of noise about this. Hope you guys will chip in so we can make some progress re HA in future versions.
-- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro ----- Original Message ----- > From: "Simon Weller" <swel...@ena.com> > To: dev@cloudstack.apache.org > Sent: Friday, 9 October, 2015 23:46:26 > Subject: Re: slow nfs = reboot all hosts ((( > Andrei, > > In a failure scenerio you want to get rid of that problematic server has > quickly > as possible. Effectively this action is fencing the host in question. > > Nux brought up a good point earlier in this thread where ultimately we need to > figure out a much better way to handling KVM failure conditions. The current > 'wait until it comes back up' is very much a flawed approach and something > we've been thinking about internally a lot lately. > > In your case, it sounds like you might need to separate your NFS storage for > primary and secondary to avoid saturating the primary storage and causing a > case where the agent believes that the primary NFS is unresponsive. > > We've certainly run into situations previously where the I/O wait state was > too > high on some ISCSI connected hosts and we saw nodes being shot due to access > times. Our approach to fixing that was reduce the number of VMs being run on > those hosts and move to higher speed connectivity between the hosts and our > storage (i.e. FC, 10Gb ethernet). > > - Si > > ________________________________________ > From: Andrei Mikhailovsky <and...@arhont.com> > Sent: Friday, October 9, 2015 5:37 PM > To: dev@cloudstack.apache.org > Subject: Re: slow nfs = reboot all hosts ((( > > I think there should be as much REISUB as possible when trying to reboot a > broken server. Doing only last B bit is a bit dangerous imho. > > Andrei > ----- Original Message ----- > > From: . "Nux!" <n...@li.nux.ro> > To: dev@cloudstack.apache.org > Sent: Friday, 9 October, 2015 6:53:43 PM > Subject: Re: slow nfs = reboot all hosts ((( > > Andrei, > > Yes, that command will just reboot without flushing anything to disk, like > cutting power. > It is made because many servers are slow to respond to normal reboot commands > under load, if at all, this could lead to corrupted data and so on. > The sysrq switch is a much better choice from this pov. > > We really need to look at a proper way of doing HA with KVM. > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro > > ----- Original Message ----- >> From: "Andrei Mikhailovsky" <and...@arhont.com> >> To: dev@cloudstack.apache.org >> Sent: Friday, 9 October, 2015 16:47:46 >> Subject: Re: slow nfs = reboot all hosts ((( > >> Thanks guys, I am not sure how i've missed that. probably the coffee didn't >> kick >> in yet ))) >> >> Anyway, am I right in saying that now the host server reboot is now forced >> without stopping the services, unmounting filesystems with potentially open >> and >> unsync-ed data, etc? >> >> Isn't this rather bad and dangerous to perform simply because of >> slow/unresponsive one of possibly many nfs servers? Not only that, the >> heartbeat also reboot the servers that are not running vms with nfs volumes? >> In >> my case it just rebooted every single host server. >> >> Very worrying indeed. >> >> Andrei >> >> >> ----- Original Message ----- >> >> From: "Nux!" <n...@li.nux.ro> >> To: dev@cloudstack.apache.org >> Sent: Friday, 9 October, 2015 12:58:19 PM >> Subject: Re: slow nfs = reboot all hosts ((( >> >> Hello, >> >> Instead of commenting 'echo b > /proc/sysrq-trigger' and also disabling your >> HA >> at the same time, perhaps there's a way to tweak the timeouts to be more >> generous with lazy NFS servers. >> >> Can you go through the logs and see what is happening before the reboot? I am >> not sure exactly which timeout the script cares about, worth investigating. >> >> Lucian >> >> -- >> Sent from the Delta quadrant using Borg technology! >> >> Nux! >> www.nux.ro >> >> ----- Original Message ----- >>> From: "Andrija Panic" <andrija.pa...@gmail.com> >>> To: dev@cloudstack.apache.org >>> Sent: Friday, 9 October, 2015 10:25:05 >>> Subject: Re: slow nfs = reboot all hosts ((( >> >>> I managed this problem the folowing way: >>> http://admintweets.com/cloudstack-disable-agent-rebooting-kvm-host/ >>> >>> Cheers >>> On Oct 9, 2015 10:21 AM, "Andrei Mikhailovsky" <and...@arhont.com> wrote: >>> >>>> Hello >>>> >>>> My issue is whenever my nfs server becomes slow to respond, ACS just >>>> bloody reboots ALL hosts servers, not just the once running vms with >>>> volumes attached to the slow nfs server. Recently, i've decided to remove >>>> some of the old snapshots to free up some disk space. I've deleted about a >>>> dozen snapshots and I was monitoring the nfs server for progress. At no >>>> point did the nfs server lost the connectivity, it just became a bit slow >>>> and under load. By slow I mean i was still able to list files on the nfs >>>> mount point and the ssh session was still working okay. It was just taking >>>> a few more seconds to respond when it comes to nfs file listings, creation, >>>> deletion, etc. However, the ACS agent has just rebooted every single host >>>> server, killing all running guests and system vms. In my case, I only have >>>> two guests with volumes on the nfs server. The rest of the vms are running >>>> off rbd storage. Yet, all host servers were rebooted, even those which were >>>> not running guests with nfs volumes. >>>> >>>> Ever since i've started using ACS, it was always pretty dumb in correctly >>>> determining if the nfs storage is still alive. I would say it has done the >>>> maniac reboot everything type of behaviour at least 5 times in the past 3 >>>> years. So, in the previous versions of ACS i've just modified the >>>> kvmheartbeat.sh and hashed out the line with "reboot" as these reboots were >>>> just pissing everyone off. >>>> >>>> After upgrading to ACS 4.5.x that script has no reboot command and I was >>>> wondering if it is still possible to instruct the kvmheartbeat script not >>>> to reboot the host servers? >>>> >>>> Thanks for your advice. >>>> > > >> Andrei