On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <[email protected]> wrote: > > > > Am 19.07.21 um 10:52 schrieb Yedidyah Bar David: > > On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <[email protected]> wrote: > >> > >> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David: > >>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm <[email protected]> wrote: > >>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David: > >>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm <[email protected]> wrote: > >>>>>> Hi Didi, > >>>>>> > >>>>>> thank you for the quick response. > >>>>>> > >>>>>> > >>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David: > >>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm <[email protected]> > >>>>>>> wrote: > >>>>>>>> Hi List, > >>>>>>>> > >>>>>>>> I'm trying to understand why my hosted engine is moved from one node > >>>>>>>> to > >>>>>>>> another from time to time. > >>>>>>>> It is happening sometime multiple times a day. But there are also > >>>>>>>> days > >>>>>>>> without it. > >>>>>>>> > >>>>>>>> I can see the following in the ovirt-hosted-engine-ha/agent.log: > >>>>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > >>>>>>>> Penalizing score by 1600 due to network status > >>>>>>>> > >>>>>>>> After that the engine will be shutdown and started on another host. > >>>>>>>> The oVirt Admin portal is showing the following around the same time: > >>>>>>>> Invalid status on Data Center Default. Setting status to Non > >>>>>>>> Responsive. > >>>>>>>> > >>>>>>>> But the whole cluster is working normally during that time. > >>>>>>>> > >>>>>>>> I believe that I have somehow a network issue on my side but I have > >>>>>>>> no > >>>>>>>> clue what kind of check is causing the network status to penalized. > >>>>>>>> > >>>>>>>> Does anyone have an idea how to investigate this further? > >>>>>>> Please check also broker.log. Do you see 'dig' failures? > >>>>>> Yes I found them as well. > >>>>>> > >>>>>> Thread-1::WARNING::2021-07-19 > >>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS query failed: > >>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > >>>>>> ;; global options: +cmd > >>>>>> ;; connection timed out; no servers could be reached > >>>>>> > >>>>>>> This happened several times already on our CI infrastructure, but > >>>>>>> yours is > >>>>>>> the first report from an actual real user. See also: > >>>>>>> > >>>>>>> https://lists.ovirt.org/archives/list/[email protected]/thread/LIGS5WXGEKWACY5GCK7Z6Q2JYVWJ6JBF/ > >>>>>> So I understand that the following command is triggered to test the > >>>>>> network: "dig +tries=1 +time=5" > >>>>> Indeed. > >>>>> > >>>>>>> I didn't open a bug for this (yet?), also because I never reproduced > >>>>>>> on my > >>>>>>> own machines and am not sure about the exact failing flow. If this is > >>>>>>> reproducible > >>>>>>> reliably for you, you might want to test the patch I pushed: > >>>>>>> > >>>>>>> https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596 > >>>>>> I'm happy to give it a try. > >>>>>> Please confirm that I need to replace this file (network.py) on all my > >>>>>> nodes (CentOS 8.4 based) which can host my engine. > >>>>> It definitely makes sense to do so, but in principle there is no problem > >>>>> with applying it only on some of them. That's especially useful if you > >>>>> try > >>>>> this first on a test env and try to enforce a reproduction somehow > >>>>> (overload > >>>>> the network, disconnect stuff, etc.). > >>>> OK will give it a try and report back. > >>> Thanks and good luck. > Do I need to restart anything after that change?
Yes, the broker. This might restart some other services there, so best put the host to maintenance during this. > Also please confirm that the comma after TCP is correct as there wasn't > one before after the timeout in row 110. It is correct, but not mandatory. We (my team, at least) often add it in such cases to make a theoretical future patch that adds another parameter not require adding it again (thus making the patch smaller and hopefully cleaner). > >>> > >>>>>>> Other ideas/opinions about how to enhance this part of the monitoring > >>>>>>> are most welcome. > >>>>>>> > >>>>>>> If this phenomenon is new for you, and you can reliably say it's not > >>>>>>> due to > >>>>>>> a recent "natural" higher network load, I wonder if it's due to some > >>>>>>> weird > >>>>>>> bug/change somewhere. > >>>>>> I'm quite sure that I see this since we moved to 4.4.(4). > >>>>>> Just for house keeping I'm running 4.4.7 now. > >>>>> We use 'dig' as the network monitor since 4.3.5, around one year before > >>>>> 4.4 > >>>>> was released: https://bugzilla.redhat.com/1659052 > >>>>> > >>>>> Which version did you use before 4.4? > >>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before migrating > >>>> to 4.4.4. > >>> I now realize that in above-linked bug we only changed the default, for > >>> new > >>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not > >>> change the default (as opposed to upgrade to 4.4, which was actually a > >>> new deployment with engine backup/restore). Do you know which version > >>> your cluster was originally deployed with? > >> Hm, I'm sorry but I don't recall this. I'm quite sure that we started > > OK, thanks for trying. > > > >> with 4.0 something. But we moved to a HE setup around September 2019. > >> But I don't recall the version. But we installed also the backup from > >> the old installation into the HE environment if I'm not wrong. > > If indeed this change was the trigger for you, you can rather easily try to > > change this to 'ping' and see if this helps - I think it's enough to change > > 'network_test' to 'ping' in /etc/ovirt-hosted-engine/hosted-engine.conf > > and restart the broker - didn't try, though. But generally speaking, I do > > not > > think we want to change the default back to 'ping', but rather make 'dns' > > work better/well. We had valid reasons to move away from ping... > OK I will try this if the tcp change does not help me. Ok. In parallel, especially if this is reproducible, you might want to do some general monitoring of your network - packet losses, etc. - and correlate this with the failures you see. Best regards, -- Didi _______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/3UVYM5ZZU4ATQN43LS7LEAKLITYBWTC4/

