Ok, thanks! So, I would still like to know if you would recommend not to use hosted engines but rather another machine for the engine?
On 09/24/2015 01:24 PM, Martin Perina wrote: > > ----- Original Message ----- >> From: "Michael Hölzl" <[email protected]> >> To: "Martin Perina" <[email protected]>, "Eli Mesika" <[email protected]> >> Cc: "Doron Fediuck" <[email protected]>, [email protected] >> Sent: Thursday, September 24, 2015 12:35:13 PM >> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine >> gets shutdown >> >> Hi, >> >> thanks for the detailed answer! In principle, I understand the issue >> now. However, I can not fully follow the argument that this is a corner >> case. In a smaller or medium sized company, I would assume that such a >> setup, consisting of two machine with a hosted engine, is not uncommon. >> Especially as there is documentation online which describes how to >> deploy this setup. Does that mean that hosted engines are in general not >> recommended? >> >> I am also wondering why the fencing could not be triggered by the hosted >> engine after the DisableFenceAtStartupInSec timeout? In the events log >> of the engine I keep on getting the message "Host hosted_engine_2 is not >> responding. It will stay in Connecting state for a grace period of 120 >> seconds and after that an attempt to fence the host will be issued.", >> which would indicate that the engine is actually trying to fence the non >> responsive host. > Unfortunately this is a bit misleading message, it's shown every time that > we start handling network exception for the host and it's fired before > the logic which manages to start/skip fencing process (this misleading > message is fixed in 3.6). But in current logic we really execute fencing > only when host status is about to change from Connecting to NonResponsive > and this happens only for the 1st time when we are still in > DisableFenceAtStartupInSec interval. During all other attempts the host is > already in status Non Responsive, so fencing is skipped. > >> On 09/24/2015 11:50 AM, Martin Perina wrote: >>> ----- Original Message ----- >>>> From: "Eli Mesika" <[email protected]> >>>> To: "Martin Perina" <[email protected]>, "Doron Fediuck" >>>> <[email protected]> >>>> Cc: "Michael Hölzl" <[email protected]>, [email protected] >>>> Sent: Thursday, September 24, 2015 11:38:39 AM >>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine >>>> gets shutdown >>>> >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Martin Perina" <[email protected]> >>>>> To: "Michael Hölzl" <[email protected]> >>>>> Cc: [email protected] >>>>> Sent: Thursday, September 24, 2015 11:02:21 AM >>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine >>>>> gets shutdown >>>>> >>>>> Hi, >>>>> >>>>> sorry for the late response, but you hit a "corner case" :-( >>>>> >>>>> Let me start explain you a few things first: >>>>> >>>>> After startup of engine there's an internval during which fencing is >>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's >>>>> set to 5 minutes. It can be changed using >>>>> >>>>> engine-config -s DisableFenceAtStartupInSec >>>>> >>>>> but please do that with caution. >>>>> >>>>> Why do we have such timeout? It's a prevention of fencing storm, which >>>>> could happen in during power issues in whole DC: when both engine and >>>>> hosts are started, for huge hosts it may take a lot of time until become >>>>> up and VDSM start to communicate with engine. So usually engine is >>>>> started >>>>> first and without this interval engine will start fencing for hosts which >>>>> are just starting ... >>>>> >>>>> Another thing: if we cannot properly fence the host, we cannot determine >>>>> if there's not just communication issue between engine and host, so we >>>>> cannot restart HA VMs on another host. The only thing we can do is to >>>>> offer "Mark host as rebooted" manual option to administrator. If >>>>> administrator execution this option, we try to restart HA VMs on >>>>> different >>>>> host ASAP, because admin took the responsibility of validation that VMs >>>>> are really not running. >>>>> >>>>> >>>>> When engine is started, following actions related to fencing are taken: >>>>> >>>>> 1. Get status of all hosts from DB and schedule Non Responding Treatment >>>>> after DisableFenceAtStartupInSec timeout is passed >>>>> >>>>> 2. Try to communicate with all host and refresh their status >>>>> >>>>> >>>>> If some host become Non Resposive during DisableFenceAtStartupInSec >>>>> interval >>>>> we skip fencing and administator will see message in Events tab that host >>>>> is Non Responsive, but fencing is disabled due to startup interval. So >>>>> administrator have to take care of such host manually. >>>>> >>>>> >>>>> Now what happened in your case: >>>>> >>>>> 1. Hosted engine VM is running on host1 with other VMs >>>>> 2. Status of host1 and host2 is Up >>>>> 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no >>>>> engine >>>>> is running to detect issue with host1 and change its status to Non >>>>> Responsive >>>>> 4. In the meantime hosted engine VM is started on host2 -> it will read >>>>> host >>>>> status from DB, but all hosts are up -> it will try to communicate >>>>> with >>>>> host1, >>>>> but it's unreachable -> so it changes host1 status Non Responsive and >>>>> starts >>>>> Non Responsive Treatment for host1 -> Non Responsive Treatment is >>>>> aborted >>>>> because engine is still in DisableFenceAtStartupInSec >>>>> >>>>> >>>>> So in normal deployment (without hosted engine) admin is notified that >>>>> host, >>>>> where engine is running, crashed and was rebooted, so he has to take a >>>>> look >>>>> and do manual steps if needed. >>>>> >>>>> In hosted engine deployment it's an issue because hosted engine VM can be >>>>> restart >>>>> on different host also in other cases then crashes (for example if host >>>>> is >>>>> overloaded hosted engine can stop hosted engine VM and restart it on >>>>> different >>>>> host, but this shouldn't happen too often). >>>>> >>>>> At the moment the only solution for this is manual: let administrator to >>>>> be >>>>> notified that host engine VM is restarted on different host, so >>>>> administrator >>>>> can check manually what was the cause for this restart and execute manual >>>>> steps >>>>> if needed. >>>>> >>>>> So to summarize: at the moment I don't see any reliable automatic >>>>> solution >>>>> for this :-( and fencing storm prevention is more important. But feel >>>>> free >>>>> to >>>>> create >>>>> a bug for this issue, maybe we can think of at least some improvement for >>>>> this use >>>>> case. >>>> Thanks for the detailed explanation Martin >>>> Really a corner case, lets see if we got more inputs on that from other >>>> users >>>> Maybe when hosted engine VM is restarted on another node we can ask for >>>> the >>>> reason and act accordingly >>>> Doron, with current implementation, is the reason for hosted engine VM >>>> restart stored anywhere ? >>> I have already discussed this with Martin Sivak and hosted engine doesn't >>> touch engine db at all. We discussed this possible solution with Martin, >>> which we could do in master and maybe in 3.6 if agreed: >>> >>> 1. Just after start of engine we can read from the db name of the host >>> which hosted engine VM is running on and store it somewhere in memory >>> for Non Responding Treatment >>> >>> 2. As a part of Non Responding Treatment we can some hosted engine >>> specific logic: >>> IF we are running as hosted engine AND >>> we are inside DisableFenceAtStartupInSec internal AND >>> non responsive host is the host stored above in step 1. AND >>> hosted engine VM is running on different host >>> THEN >>> execute fencing for non responsive host even when we are >>> inside DisableFenceAtStartupInSec internal >>> >>> But it can cause unnecessary fence for the case that whole datacenter >>> recovers from power failure. >>> >>>>> Thanks >>>>> >>>>> Martin Perina >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Michael Hölzl" <[email protected]> >>>>>> To: "Martin Perina" <[email protected]> >>>>>> Cc: [email protected] >>>>>> Sent: Monday, September 21, 2015 4:47:06 PM >>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with >>>>>> engine >>>>>> gets shutdown >>>>>> >>>>>> Hi, >>>>>> >>>>>> The whole engine.log including the shutdown time (was performed around >>>>>> 9:19) >>>>>> http://pastebin.com/cdY9uTkJ >>>>>> >>>>>> vdsm.log of host01 (the host which kept on running and took over the >>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin): >>>>>> 1 : http://pastebin.com/dr9jNTek >>>>>> 2 : http://pastebin.com/cuyHL6ne >>>>>> 3 : http://pastebin.com/7x2ZQy1y >>>>>> >>>>>> Michael >>>>>> >>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote: >>>>>>> Hi, >>>>>>> >>>>>>> could you please post whole engine.log (from the time which you turned >>>>>>> off >>>>>>> the host with engine VM) and also vdsm.log from both hosts? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Martin Perina >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Michael Hölzl" <[email protected]> >>>>>>>> To: [email protected] >>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM >>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with engine >>>>>>>> gets >>>>>>>> shutdown >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> we are trying to setup an ovirt environment with two hosts, both >>>>>>>> connected to a ISCSI storage device, a hosted engine and power >>>>>>>> management configured over ILO. So far it seems to work fine in our >>>>>>>> testing setup and starting/stopping VMs works smoothly with proper >>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs >>>>>>>> now >>>>>>>> and started to manually shutdown a host while there are still VMs >>>>>>>> running on that machine (to simulate power failure or a kernel panic). >>>>>>>> The expected outcome was that all machines were HA is enabled, are >>>>>>>> booted again. This works if the machine with the failure does not have >>>>>>>> the engine running. If the machine with the hosted engine VM gets >>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs end >>>>>>>> up >>>>>>>> in an unkown state. However, the engine itself starts correctly on the >>>>>>>> second host and it seems like it tries to fence the other host (as >>>>>>>> expected) - Events which we get in the open virtualization manager: >>>>>>>> 1. Host hosted_engine_2 is non responsive >>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to >>>>>>>> execute Status command on Host hosted_engine_2. >>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power >>>>>>>> management configured. Please check the host status, manually reboot >>>>>>>> it, >>>>>>>> and click "Confirm Host Has Been Rebooted" >>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in Connecting >>>>>>>> state for a grace period of 124 seconds and after that an attempt to >>>>>>>> fence the host will be issued. >>>>>>>> >>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log >>>>>>>> file >>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy >>>>>>>> So the host detects the machine is not responding and wants to fence >>>>>>>> it. >>>>>>>> But although the host has power management configured over ILO, the >>>>>>>> engine thinks that it is not. As a result the second host does not get >>>>>>>> fenced and VMs are not migrated to the running machine. >>>>>>>> In the log files there are also a lot of time out exception. But I >>>>>>>> guess >>>>>>>> that this is because the host cannot connect to the other machine. >>>>>>>> >>>>>>>> Did anybody face similar problems with HA? Or any clue what the >>>>>>>> problem >>>>>>>> might be? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> ---- >>>>>>>> ovirt version: 3.5.4 >>>>>>>> Hosted engine VM OS: Cent OS 6.5 >>>>>>>> Host Machines OS: Cent OS 7 >>>>>>>> >>>>>>>> P.S. We also have to note that we had problems with the command >>>>>>>> fence_ipmilan at the beginning. We were receiving the message "Unable >>>>>>>> to >>>>>>>> obtain correct plug status or plug is not available," whenever the >>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4 >>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls >>>>>>>> fence_ilo4 and passes the arguments. >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> [email protected] >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

