Thanks for the help! I will definitely stay tuned with updates on this matter.
Michael On 09/24/2015 03:13 PM, Martin Perina wrote: > I created a bug covering this: > > https://bugzilla.redhat.com/show_bug.cgi?id=1266099 > > ----- Original Message ----- >> From: "Martin Sivak" <[email protected]> >> To: "Michael Hölzl" <[email protected]> >> Cc: "Martin Perina" <[email protected]>, [email protected] >> Sent: Thursday, September 24, 2015 2:59:52 PM >> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine >> gets shutdown >> >> Hi Michael, >> >> Martin summed the situation neatly, I would just add that this issue >> is not limited to the size of your setup. The same would happen to HA >> VMs running on the same host as the hosted engine even if the cluster >> had 50 hosts... >> >> About the recommended way of engine deployment: It really is about >> whether you can tolerate your engine to be down for a longer time >> (starting another host using a backup db). >> >> Hosted engine restores your management in an automated way and without >> any data loss. However I agree that the fact that you have to tend to >> your HA VMs manually after an engine restart is not nice. Fortunately >> that should only happen when your host (or vdsm) dies and does not >> come up for an extended period of time. >> >> The summary would be.. there will be no HA handling if the host >> running the engine is down, independently on whether the deployment is >> hosted engine or standalone engine. If the issue is related to the >> software only then there is no real difference. >> >> - When a host with the standalone engine dies, the VMs are fine, but >> if anything happens while the engine is down (and reinstalling a >> standalone engine takes time + you need a very fresh db backup) you >> might again face issues with HA VMs being down or not starting when >> the engine comes up. >> >> - When a hosted engine dies because of a host failure, some VMs >> generally disappear with it. The engine will come up automatically and >> HA VMs from the original hosts have to be manually pushed to work. >> This requires some manual action, but I see it as less demanding than >> the first case. >> >> - When a hosted engine VM is stopped properly by the tooling it will >> be restarted elsewhere and it will be able to connect to the original >> host just fine. The engine will then make sure that all HA VMs are up >> even if the the VMs died while the engine was down. >> >> So I would recommend hosted engine based deployment. And ask for a bit >> of patience as we have a plan how to mitigate the second case to some >> extent without compromising the fencing storm prevention. >> >> Best regards >> >> -- >> Martin Sivak >> [email protected] >> SLA RHEV-M >> >> >> On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <[email protected]> wrote: >>> Ok, thanks! >>> >>> So, I would still like to know if you would recommend not to use hosted >>> engines but rather another machine for the engine? >>> >>> On 09/24/2015 01:24 PM, Martin Perina wrote: >>>> ----- Original Message ----- >>>>> From: "Michael Hölzl" <[email protected]> >>>>> To: "Martin Perina" <[email protected]>, "Eli Mesika" >>>>> <[email protected]> >>>>> Cc: "Doron Fediuck" <[email protected]>, [email protected] >>>>> Sent: Thursday, September 24, 2015 12:35:13 PM >>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine >>>>> gets shutdown >>>>> >>>>> Hi, >>>>> >>>>> thanks for the detailed answer! In principle, I understand the issue >>>>> now. However, I can not fully follow the argument that this is a corner >>>>> case. In a smaller or medium sized company, I would assume that such a >>>>> setup, consisting of two machine with a hosted engine, is not uncommon. >>>>> Especially as there is documentation online which describes how to >>>>> deploy this setup. Does that mean that hosted engines are in general not >>>>> recommended? >>>>> >>>>> I am also wondering why the fencing could not be triggered by the hosted >>>>> engine after the DisableFenceAtStartupInSec timeout? In the events log >>>>> of the engine I keep on getting the message "Host hosted_engine_2 is not >>>>> responding. It will stay in Connecting state for a grace period of 120 >>>>> seconds and after that an attempt to fence the host will be issued.", >>>>> which would indicate that the engine is actually trying to fence the non >>>>> responsive host. >>>> Unfortunately this is a bit misleading message, it's shown every time that >>>> we start handling network exception for the host and it's fired before >>>> the logic which manages to start/skip fencing process (this misleading >>>> message is fixed in 3.6). But in current logic we really execute fencing >>>> only when host status is about to change from Connecting to NonResponsive >>>> and this happens only for the 1st time when we are still in >>>> DisableFenceAtStartupInSec interval. During all other attempts the host is >>>> already in status Non Responsive, so fencing is skipped. >>>> >>>>> On 09/24/2015 11:50 AM, Martin Perina wrote: >>>>>> ----- Original Message ----- >>>>>>> From: "Eli Mesika" <[email protected]> >>>>>>> To: "Martin Perina" <[email protected]>, "Doron Fediuck" >>>>>>> <[email protected]> >>>>>>> Cc: "Michael Hölzl" <[email protected]>, [email protected] >>>>>>> Sent: Thursday, September 24, 2015 11:38:39 AM >>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with >>>>>>> engine >>>>>>> gets shutdown >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Martin Perina" <[email protected]> >>>>>>>> To: "Michael Hölzl" <[email protected]> >>>>>>>> Cc: [email protected] >>>>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM >>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with >>>>>>>> engine >>>>>>>> gets shutdown >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> sorry for the late response, but you hit a "corner case" :-( >>>>>>>> >>>>>>>> Let me start explain you a few things first: >>>>>>>> >>>>>>>> After startup of engine there's an internval during which fencing is >>>>>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's >>>>>>>> set to 5 minutes. It can be changed using >>>>>>>> >>>>>>>> engine-config -s DisableFenceAtStartupInSec >>>>>>>> >>>>>>>> but please do that with caution. >>>>>>>> >>>>>>>> Why do we have such timeout? It's a prevention of fencing storm, which >>>>>>>> could happen in during power issues in whole DC: when both engine and >>>>>>>> hosts are started, for huge hosts it may take a lot of time until >>>>>>>> become >>>>>>>> up and VDSM start to communicate with engine. So usually engine is >>>>>>>> started >>>>>>>> first and without this interval engine will start fencing for hosts >>>>>>>> which >>>>>>>> are just starting ... >>>>>>>> >>>>>>>> Another thing: if we cannot properly fence the host, we cannot >>>>>>>> determine >>>>>>>> if there's not just communication issue between engine and host, so we >>>>>>>> cannot restart HA VMs on another host. The only thing we can do is to >>>>>>>> offer "Mark host as rebooted" manual option to administrator. If >>>>>>>> administrator execution this option, we try to restart HA VMs on >>>>>>>> different >>>>>>>> host ASAP, because admin took the responsibility of validation that >>>>>>>> VMs >>>>>>>> are really not running. >>>>>>>> >>>>>>>> >>>>>>>> When engine is started, following actions related to fencing are >>>>>>>> taken: >>>>>>>> >>>>>>>> 1. Get status of all hosts from DB and schedule Non Responding >>>>>>>> Treatment >>>>>>>> after DisableFenceAtStartupInSec timeout is passed >>>>>>>> >>>>>>>> 2. Try to communicate with all host and refresh their status >>>>>>>> >>>>>>>> >>>>>>>> If some host become Non Resposive during DisableFenceAtStartupInSec >>>>>>>> interval >>>>>>>> we skip fencing and administator will see message in Events tab that >>>>>>>> host >>>>>>>> is Non Responsive, but fencing is disabled due to startup interval. So >>>>>>>> administrator have to take care of such host manually. >>>>>>>> >>>>>>>> >>>>>>>> Now what happened in your case: >>>>>>>> >>>>>>>> 1. Hosted engine VM is running on host1 with other VMs >>>>>>>> 2. Status of host1 and host2 is Up >>>>>>>> 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> >>>>>>>> no >>>>>>>> engine >>>>>>>> is running to detect issue with host1 and change its status to Non >>>>>>>> Responsive >>>>>>>> 4. In the meantime hosted engine VM is started on host2 -> it will >>>>>>>> read >>>>>>>> host >>>>>>>> status from DB, but all hosts are up -> it will try to communicate >>>>>>>> with >>>>>>>> host1, >>>>>>>> but it's unreachable -> so it changes host1 status Non Responsive >>>>>>>> and >>>>>>>> starts >>>>>>>> Non Responsive Treatment for host1 -> Non Responsive Treatment is >>>>>>>> aborted >>>>>>>> because engine is still in DisableFenceAtStartupInSec >>>>>>>> >>>>>>>> >>>>>>>> So in normal deployment (without hosted engine) admin is notified that >>>>>>>> host, >>>>>>>> where engine is running, crashed and was rebooted, so he has to take a >>>>>>>> look >>>>>>>> and do manual steps if needed. >>>>>>>> >>>>>>>> In hosted engine deployment it's an issue because hosted engine VM can >>>>>>>> be >>>>>>>> restart >>>>>>>> on different host also in other cases then crashes (for example if >>>>>>>> host >>>>>>>> is >>>>>>>> overloaded hosted engine can stop hosted engine VM and restart it on >>>>>>>> different >>>>>>>> host, but this shouldn't happen too often). >>>>>>>> >>>>>>>> At the moment the only solution for this is manual: let administrator >>>>>>>> to >>>>>>>> be >>>>>>>> notified that host engine VM is restarted on different host, so >>>>>>>> administrator >>>>>>>> can check manually what was the cause for this restart and execute >>>>>>>> manual >>>>>>>> steps >>>>>>>> if needed. >>>>>>>> >>>>>>>> So to summarize: at the moment I don't see any reliable automatic >>>>>>>> solution >>>>>>>> for this :-( and fencing storm prevention is more important. But feel >>>>>>>> free >>>>>>>> to >>>>>>>> create >>>>>>>> a bug for this issue, maybe we can think of at least some improvement >>>>>>>> for >>>>>>>> this use >>>>>>>> case. >>>>>>> Thanks for the detailed explanation Martin >>>>>>> Really a corner case, lets see if we got more inputs on that from other >>>>>>> users >>>>>>> Maybe when hosted engine VM is restarted on another node we can ask for >>>>>>> the >>>>>>> reason and act accordingly >>>>>>> Doron, with current implementation, is the reason for hosted engine VM >>>>>>> restart stored anywhere ? >>>>>> I have already discussed this with Martin Sivak and hosted engine >>>>>> doesn't >>>>>> touch engine db at all. We discussed this possible solution with Martin, >>>>>> which we could do in master and maybe in 3.6 if agreed: >>>>>> >>>>>> 1. Just after start of engine we can read from the db name of the host >>>>>> which hosted engine VM is running on and store it somewhere in >>>>>> memory >>>>>> for Non Responding Treatment >>>>>> >>>>>> 2. As a part of Non Responding Treatment we can some hosted engine >>>>>> specific logic: >>>>>> IF we are running as hosted engine AND >>>>>> we are inside DisableFenceAtStartupInSec internal AND >>>>>> non responsive host is the host stored above in step 1. AND >>>>>> hosted engine VM is running on different host >>>>>> THEN >>>>>> execute fencing for non responsive host even when we are >>>>>> inside DisableFenceAtStartupInSec internal >>>>>> >>>>>> But it can cause unnecessary fence for the case that whole datacenter >>>>>> recovers from power failure. >>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Martin Perina >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Michael Hölzl" <[email protected]> >>>>>>>>> To: "Martin Perina" <[email protected]> >>>>>>>>> Cc: [email protected] >>>>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM >>>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with >>>>>>>>> engine >>>>>>>>> gets shutdown >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> The whole engine.log including the shutdown time (was performed >>>>>>>>> around >>>>>>>>> 9:19) >>>>>>>>> http://pastebin.com/cdY9uTkJ >>>>>>>>> >>>>>>>>> vdsm.log of host01 (the host which kept on running and took over the >>>>>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin): >>>>>>>>> 1 : http://pastebin.com/dr9jNTek >>>>>>>>> 2 : http://pastebin.com/cuyHL6ne >>>>>>>>> 3 : http://pastebin.com/7x2ZQy1y >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> could you please post whole engine.log (from the time which you >>>>>>>>>> turned >>>>>>>>>> off >>>>>>>>>> the host with engine VM) and also vdsm.log from both hosts? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Martin Perina >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- >>>>>>>>>>> From: "Michael Hölzl" <[email protected]> >>>>>>>>>>> To: [email protected] >>>>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM >>>>>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with >>>>>>>>>>> engine >>>>>>>>>>> gets >>>>>>>>>>> shutdown >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> we are trying to setup an ovirt environment with two hosts, both >>>>>>>>>>> connected to a ISCSI storage device, a hosted engine and power >>>>>>>>>>> management configured over ILO. So far it seems to work fine in our >>>>>>>>>>> testing setup and starting/stopping VMs works smoothly with proper >>>>>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs >>>>>>>>>>> now >>>>>>>>>>> and started to manually shutdown a host while there are still VMs >>>>>>>>>>> running on that machine (to simulate power failure or a kernel >>>>>>>>>>> panic). >>>>>>>>>>> The expected outcome was that all machines were HA is enabled, are >>>>>>>>>>> booted again. This works if the machine with the failure does not >>>>>>>>>>> have >>>>>>>>>>> the engine running. If the machine with the hosted engine VM gets >>>>>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs >>>>>>>>>>> end >>>>>>>>>>> up >>>>>>>>>>> in an unkown state. However, the engine itself starts correctly on >>>>>>>>>>> the >>>>>>>>>>> second host and it seems like it tries to fence the other host (as >>>>>>>>>>> expected) - Events which we get in the open virtualization manager: >>>>>>>>>>> 1. Host hosted_engine_2 is non responsive >>>>>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy >>>>>>>>>>> to >>>>>>>>>>> execute Status command on Host hosted_engine_2. >>>>>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power >>>>>>>>>>> management configured. Please check the host status, manually >>>>>>>>>>> reboot >>>>>>>>>>> it, >>>>>>>>>>> and click "Confirm Host Has Been Rebooted" >>>>>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in >>>>>>>>>>> Connecting >>>>>>>>>>> state for a grace period of 124 seconds and after that an attempt >>>>>>>>>>> to >>>>>>>>>>> fence the host will be issued. >>>>>>>>>>> >>>>>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log >>>>>>>>>>> file >>>>>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy >>>>>>>>>>> So the host detects the machine is not responding and wants to >>>>>>>>>>> fence >>>>>>>>>>> it. >>>>>>>>>>> But although the host has power management configured over ILO, the >>>>>>>>>>> engine thinks that it is not. As a result the second host does not >>>>>>>>>>> get >>>>>>>>>>> fenced and VMs are not migrated to the running machine. >>>>>>>>>>> In the log files there are also a lot of time out exception. But I >>>>>>>>>>> guess >>>>>>>>>>> that this is because the host cannot connect to the other machine. >>>>>>>>>>> >>>>>>>>>>> Did anybody face similar problems with HA? Or any clue what the >>>>>>>>>>> problem >>>>>>>>>>> might be? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ---- >>>>>>>>>>> ovirt version: 3.5.4 >>>>>>>>>>> Hosted engine VM OS: Cent OS 6.5 >>>>>>>>>>> Host Machines OS: Cent OS 7 >>>>>>>>>>> >>>>>>>>>>> P.S. We also have to note that we had problems with the command >>>>>>>>>>> fence_ipmilan at the beginning. We were receiving the message >>>>>>>>>>> "Unable >>>>>>>>>>> to >>>>>>>>>>> obtain correct plug status or plug is not available," whenever the >>>>>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4 >>>>>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls >>>>>>>>>>> fence_ilo4 and passes the arguments. >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list >>>>>>>> [email protected] >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>>>>> >>> _______________________________________________ >>> Users mailing list >>> [email protected] >>> http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

