I created a bug covering this: https://bugzilla.redhat.com/show_bug.cgi?id=1266099
----- Original Message ----- > From: "Martin Sivak" <[email protected]> > To: "Michael Hölzl" <[email protected]> > Cc: "Martin Perina" <[email protected]>, [email protected] > Sent: Thursday, September 24, 2015 2:59:52 PM > Subject: Re: [ovirt-users] HA - Fencing not working when host with engine > gets shutdown > > Hi Michael, > > Martin summed the situation neatly, I would just add that this issue > is not limited to the size of your setup. The same would happen to HA > VMs running on the same host as the hosted engine even if the cluster > had 50 hosts... > > About the recommended way of engine deployment: It really is about > whether you can tolerate your engine to be down for a longer time > (starting another host using a backup db). > > Hosted engine restores your management in an automated way and without > any data loss. However I agree that the fact that you have to tend to > your HA VMs manually after an engine restart is not nice. Fortunately > that should only happen when your host (or vdsm) dies and does not > come up for an extended period of time. > > The summary would be.. there will be no HA handling if the host > running the engine is down, independently on whether the deployment is > hosted engine or standalone engine. If the issue is related to the > software only then there is no real difference. > > - When a host with the standalone engine dies, the VMs are fine, but > if anything happens while the engine is down (and reinstalling a > standalone engine takes time + you need a very fresh db backup) you > might again face issues with HA VMs being down or not starting when > the engine comes up. > > - When a hosted engine dies because of a host failure, some VMs > generally disappear with it. The engine will come up automatically and > HA VMs from the original hosts have to be manually pushed to work. > This requires some manual action, but I see it as less demanding than > the first case. > > - When a hosted engine VM is stopped properly by the tooling it will > be restarted elsewhere and it will be able to connect to the original > host just fine. The engine will then make sure that all HA VMs are up > even if the the VMs died while the engine was down. > > So I would recommend hosted engine based deployment. And ask for a bit > of patience as we have a plan how to mitigate the second case to some > extent without compromising the fencing storm prevention. > > Best regards > > -- > Martin Sivak > [email protected] > SLA RHEV-M > > > On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <[email protected]> wrote: > > Ok, thanks! > > > > So, I would still like to know if you would recommend not to use hosted > > engines but rather another machine for the engine? > > > > On 09/24/2015 01:24 PM, Martin Perina wrote: > >> > >> ----- Original Message ----- > >>> From: "Michael Hölzl" <[email protected]> > >>> To: "Martin Perina" <[email protected]>, "Eli Mesika" > >>> <[email protected]> > >>> Cc: "Doron Fediuck" <[email protected]>, [email protected] > >>> Sent: Thursday, September 24, 2015 12:35:13 PM > >>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine > >>> gets shutdown > >>> > >>> Hi, > >>> > >>> thanks for the detailed answer! In principle, I understand the issue > >>> now. However, I can not fully follow the argument that this is a corner > >>> case. In a smaller or medium sized company, I would assume that such a > >>> setup, consisting of two machine with a hosted engine, is not uncommon. > >>> Especially as there is documentation online which describes how to > >>> deploy this setup. Does that mean that hosted engines are in general not > >>> recommended? > >>> > >>> I am also wondering why the fencing could not be triggered by the hosted > >>> engine after the DisableFenceAtStartupInSec timeout? In the events log > >>> of the engine I keep on getting the message "Host hosted_engine_2 is not > >>> responding. It will stay in Connecting state for a grace period of 120 > >>> seconds and after that an attempt to fence the host will be issued.", > >>> which would indicate that the engine is actually trying to fence the non > >>> responsive host. > >> Unfortunately this is a bit misleading message, it's shown every time that > >> we start handling network exception for the host and it's fired before > >> the logic which manages to start/skip fencing process (this misleading > >> message is fixed in 3.6). But in current logic we really execute fencing > >> only when host status is about to change from Connecting to NonResponsive > >> and this happens only for the 1st time when we are still in > >> DisableFenceAtStartupInSec interval. During all other attempts the host is > >> already in status Non Responsive, so fencing is skipped. > >> > >>> On 09/24/2015 11:50 AM, Martin Perina wrote: > >>>> ----- Original Message ----- > >>>>> From: "Eli Mesika" <[email protected]> > >>>>> To: "Martin Perina" <[email protected]>, "Doron Fediuck" > >>>>> <[email protected]> > >>>>> Cc: "Michael Hölzl" <[email protected]>, [email protected] > >>>>> Sent: Thursday, September 24, 2015 11:38:39 AM > >>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with > >>>>> engine > >>>>> gets shutdown > >>>>> > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> From: "Martin Perina" <[email protected]> > >>>>>> To: "Michael Hölzl" <[email protected]> > >>>>>> Cc: [email protected] > >>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM > >>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with > >>>>>> engine > >>>>>> gets shutdown > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> sorry for the late response, but you hit a "corner case" :-( > >>>>>> > >>>>>> Let me start explain you a few things first: > >>>>>> > >>>>>> After startup of engine there's an internval during which fencing is > >>>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's > >>>>>> set to 5 minutes. It can be changed using > >>>>>> > >>>>>> engine-config -s DisableFenceAtStartupInSec > >>>>>> > >>>>>> but please do that with caution. > >>>>>> > >>>>>> Why do we have such timeout? It's a prevention of fencing storm, which > >>>>>> could happen in during power issues in whole DC: when both engine and > >>>>>> hosts are started, for huge hosts it may take a lot of time until > >>>>>> become > >>>>>> up and VDSM start to communicate with engine. So usually engine is > >>>>>> started > >>>>>> first and without this interval engine will start fencing for hosts > >>>>>> which > >>>>>> are just starting ... > >>>>>> > >>>>>> Another thing: if we cannot properly fence the host, we cannot > >>>>>> determine > >>>>>> if there's not just communication issue between engine and host, so we > >>>>>> cannot restart HA VMs on another host. The only thing we can do is to > >>>>>> offer "Mark host as rebooted" manual option to administrator. If > >>>>>> administrator execution this option, we try to restart HA VMs on > >>>>>> different > >>>>>> host ASAP, because admin took the responsibility of validation that > >>>>>> VMs > >>>>>> are really not running. > >>>>>> > >>>>>> > >>>>>> When engine is started, following actions related to fencing are > >>>>>> taken: > >>>>>> > >>>>>> 1. Get status of all hosts from DB and schedule Non Responding > >>>>>> Treatment > >>>>>> after DisableFenceAtStartupInSec timeout is passed > >>>>>> > >>>>>> 2. Try to communicate with all host and refresh their status > >>>>>> > >>>>>> > >>>>>> If some host become Non Resposive during DisableFenceAtStartupInSec > >>>>>> interval > >>>>>> we skip fencing and administator will see message in Events tab that > >>>>>> host > >>>>>> is Non Responsive, but fencing is disabled due to startup interval. So > >>>>>> administrator have to take care of such host manually. > >>>>>> > >>>>>> > >>>>>> Now what happened in your case: > >>>>>> > >>>>>> 1. Hosted engine VM is running on host1 with other VMs > >>>>>> 2. Status of host1 and host2 is Up > >>>>>> 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> > >>>>>> no > >>>>>> engine > >>>>>> is running to detect issue with host1 and change its status to Non > >>>>>> Responsive > >>>>>> 4. In the meantime hosted engine VM is started on host2 -> it will > >>>>>> read > >>>>>> host > >>>>>> status from DB, but all hosts are up -> it will try to communicate > >>>>>> with > >>>>>> host1, > >>>>>> but it's unreachable -> so it changes host1 status Non Responsive > >>>>>> and > >>>>>> starts > >>>>>> Non Responsive Treatment for host1 -> Non Responsive Treatment is > >>>>>> aborted > >>>>>> because engine is still in DisableFenceAtStartupInSec > >>>>>> > >>>>>> > >>>>>> So in normal deployment (without hosted engine) admin is notified that > >>>>>> host, > >>>>>> where engine is running, crashed and was rebooted, so he has to take a > >>>>>> look > >>>>>> and do manual steps if needed. > >>>>>> > >>>>>> In hosted engine deployment it's an issue because hosted engine VM can > >>>>>> be > >>>>>> restart > >>>>>> on different host also in other cases then crashes (for example if > >>>>>> host > >>>>>> is > >>>>>> overloaded hosted engine can stop hosted engine VM and restart it on > >>>>>> different > >>>>>> host, but this shouldn't happen too often). > >>>>>> > >>>>>> At the moment the only solution for this is manual: let administrator > >>>>>> to > >>>>>> be > >>>>>> notified that host engine VM is restarted on different host, so > >>>>>> administrator > >>>>>> can check manually what was the cause for this restart and execute > >>>>>> manual > >>>>>> steps > >>>>>> if needed. > >>>>>> > >>>>>> So to summarize: at the moment I don't see any reliable automatic > >>>>>> solution > >>>>>> for this :-( and fencing storm prevention is more important. But feel > >>>>>> free > >>>>>> to > >>>>>> create > >>>>>> a bug for this issue, maybe we can think of at least some improvement > >>>>>> for > >>>>>> this use > >>>>>> case. > >>>>> Thanks for the detailed explanation Martin > >>>>> Really a corner case, lets see if we got more inputs on that from other > >>>>> users > >>>>> Maybe when hosted engine VM is restarted on another node we can ask for > >>>>> the > >>>>> reason and act accordingly > >>>>> Doron, with current implementation, is the reason for hosted engine VM > >>>>> restart stored anywhere ? > >>>> I have already discussed this with Martin Sivak and hosted engine > >>>> doesn't > >>>> touch engine db at all. We discussed this possible solution with Martin, > >>>> which we could do in master and maybe in 3.6 if agreed: > >>>> > >>>> 1. Just after start of engine we can read from the db name of the host > >>>> which hosted engine VM is running on and store it somewhere in > >>>> memory > >>>> for Non Responding Treatment > >>>> > >>>> 2. As a part of Non Responding Treatment we can some hosted engine > >>>> specific logic: > >>>> IF we are running as hosted engine AND > >>>> we are inside DisableFenceAtStartupInSec internal AND > >>>> non responsive host is the host stored above in step 1. AND > >>>> hosted engine VM is running on different host > >>>> THEN > >>>> execute fencing for non responsive host even when we are > >>>> inside DisableFenceAtStartupInSec internal > >>>> > >>>> But it can cause unnecessary fence for the case that whole datacenter > >>>> recovers from power failure. > >>>> > >>>>>> Thanks > >>>>>> > >>>>>> Martin Perina > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Michael Hölzl" <[email protected]> > >>>>>>> To: "Martin Perina" <[email protected]> > >>>>>>> Cc: [email protected] > >>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM > >>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with > >>>>>>> engine > >>>>>>> gets shutdown > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> The whole engine.log including the shutdown time (was performed > >>>>>>> around > >>>>>>> 9:19) > >>>>>>> http://pastebin.com/cdY9uTkJ > >>>>>>> > >>>>>>> vdsm.log of host01 (the host which kept on running and took over the > >>>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin): > >>>>>>> 1 : http://pastebin.com/dr9jNTek > >>>>>>> 2 : http://pastebin.com/cuyHL6ne > >>>>>>> 3 : http://pastebin.com/7x2ZQy1y > >>>>>>> > >>>>>>> Michael > >>>>>>> > >>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> could you please post whole engine.log (from the time which you > >>>>>>>> turned > >>>>>>>> off > >>>>>>>> the host with engine VM) and also vdsm.log from both hosts? > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> > >>>>>>>> Martin Perina > >>>>>>>> > >>>>>>>> ----- Original Message ----- > >>>>>>>>> From: "Michael Hölzl" <[email protected]> > >>>>>>>>> To: [email protected] > >>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM > >>>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with > >>>>>>>>> engine > >>>>>>>>> gets > >>>>>>>>> shutdown > >>>>>>>>> > >>>>>>>>> Hi all, > >>>>>>>>> > >>>>>>>>> we are trying to setup an ovirt environment with two hosts, both > >>>>>>>>> connected to a ISCSI storage device, a hosted engine and power > >>>>>>>>> management configured over ILO. So far it seems to work fine in our > >>>>>>>>> testing setup and starting/stopping VMs works smoothly with proper > >>>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs > >>>>>>>>> now > >>>>>>>>> and started to manually shutdown a host while there are still VMs > >>>>>>>>> running on that machine (to simulate power failure or a kernel > >>>>>>>>> panic). > >>>>>>>>> The expected outcome was that all machines were HA is enabled, are > >>>>>>>>> booted again. This works if the machine with the failure does not > >>>>>>>>> have > >>>>>>>>> the engine running. If the machine with the hosted engine VM gets > >>>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs > >>>>>>>>> end > >>>>>>>>> up > >>>>>>>>> in an unkown state. However, the engine itself starts correctly on > >>>>>>>>> the > >>>>>>>>> second host and it seems like it tries to fence the other host (as > >>>>>>>>> expected) - Events which we get in the open virtualization manager: > >>>>>>>>> 1. Host hosted_engine_2 is non responsive > >>>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy > >>>>>>>>> to > >>>>>>>>> execute Status command on Host hosted_engine_2. > >>>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power > >>>>>>>>> management configured. Please check the host status, manually > >>>>>>>>> reboot > >>>>>>>>> it, > >>>>>>>>> and click "Confirm Host Has Been Rebooted" > >>>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in > >>>>>>>>> Connecting > >>>>>>>>> state for a grace period of 124 seconds and after that an attempt > >>>>>>>>> to > >>>>>>>>> fence the host will be issued. > >>>>>>>>> > >>>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log > >>>>>>>>> file > >>>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy > >>>>>>>>> So the host detects the machine is not responding and wants to > >>>>>>>>> fence > >>>>>>>>> it. > >>>>>>>>> But although the host has power management configured over ILO, the > >>>>>>>>> engine thinks that it is not. As a result the second host does not > >>>>>>>>> get > >>>>>>>>> fenced and VMs are not migrated to the running machine. > >>>>>>>>> In the log files there are also a lot of time out exception. But I > >>>>>>>>> guess > >>>>>>>>> that this is because the host cannot connect to the other machine. > >>>>>>>>> > >>>>>>>>> Did anybody face similar problems with HA? Or any clue what the > >>>>>>>>> problem > >>>>>>>>> might be? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Michael > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ---- > >>>>>>>>> ovirt version: 3.5.4 > >>>>>>>>> Hosted engine VM OS: Cent OS 6.5 > >>>>>>>>> Host Machines OS: Cent OS 7 > >>>>>>>>> > >>>>>>>>> P.S. We also have to note that we had problems with the command > >>>>>>>>> fence_ipmilan at the beginning. We were receiving the message > >>>>>>>>> "Unable > >>>>>>>>> to > >>>>>>>>> obtain correct plug status or plug is not available," whenever the > >>>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4 > >>>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls > >>>>>>>>> fence_ilo4 and passes the arguments. > >>>>>>>>> _______________________________________________ > >>>>>>>>> Users mailing list > >>>>>>>>> [email protected] > >>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users > >>>>>>>>> > >>>>>> _______________________________________________ > >>>>>> Users mailing list > >>>>>> [email protected] > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > >>>>>> > > _______________________________________________ > > Users mailing list > > [email protected] > > http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

