On Mon, Aug 22, 2022 at 9:52 AM <markec...@gmail.com> wrote: > > Hi! > In the last few days I am having problem with Hosted-Engine, it keeps > restarting. Sometimes after few minutes, sometimes after few hours.. > I haven't done any changes on oVirt or network in that time. > Version is 4.4.10.7-1.el8. (this was also installation version) > > Here are the logs: > Agent.log------------------------------ > MainThread::INFO::2022-08-21 > 09:48:36,200::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 2440)
2440 is pretty low - did you check what lowers it, other than the single failed network test below? > MainThread::INFO::2022-08-21 > 09:48:36,200::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Best remote host node3.ovirt.example.org (id: 3, score: 2440) > MainThread::ERROR::2022-08-21 > 09:48:46,212::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > Host node3.ovirt.example.org (id 3) score is significantly better than local > score, shutting down VM on this host > MainThread::INFO::2022-08-21 > 09:48:46,641::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineUp-EngineStop) sent? > ignored > MainThread::INFO::2022-08-21 > 09:48:46,706::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStop (score: 3400) > MainThread::INFO::2022-08-21 > 09:48:46,706::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Best remote host node3.ovirt.example.org (id: 3, score: 3400) > MainThread::INFO::2022-08-21 > 09:48:56,714::hosted_engine::934::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) > Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` > MainThread::INFO::2022-08-21 > 09:48:56,871::hosted_engine::941::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) > stdout: > MainThread::INFO::2022-08-21 > 09:48:56,871::hosted_engine::942::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) > stderr: > MainThread::ERROR::2022-08-21 > 09:48:56,871::hosted_engine::950::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) > Engine VM stopped on localhost > MainThread::INFO::2022-08-21 > 09:48:56,880::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) > Timeout set to Sun Aug 21 09:53:56 2022 while transitioning <class > 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineStop'> > MainThread::INFO::2022-08-21 > 09:48:56,959::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStop (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:06,977::states::537::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > Engine vm not running on local host > MainThread::INFO::2022-08-21 > 09:49:06,983::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) > Timeout cleared while transitioning <class > 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineDown'> > MainThread::INFO::2022-08-21 > 09:49:07,173::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineStop-EngineDown) sent? > ignored > MainThread::INFO::2022-08-21 > 09:49:07,795::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineDown (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:16,811::states::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > Engine down and local host has best score (3400), attempting to start engine > VM > MainThread::INFO::2022-08-21 > 09:49:16,998::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineDown-EngineStart) sent? > ignored > MainThread::INFO::2022-08-21 > 09:49:17,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStart (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:17,195::hosted_engine::895::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) > Ensuring VDSM state is clear for engine VM > MainThread::INFO::2022-08-21 > 09:49:17,200::hosted_engine::915::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) > Cleaning state for non-running VM > MainThread::INFO::2022-08-21 > 09:49:18,211::hosted_engine::907::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) > Vdsm state for VM clean > MainThread::INFO::2022-08-21 > 09:49:18,212::hosted_engine::853::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) > Starting vm using `/usr/sbin/hosted-engine --vm-start` > MainThread::INFO::2022-08-21 > 09:49:18,814::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) > stdout: VM in WaitForLaunch > > MainThread::INFO::2022-08-21 > 09:49:18,814::hosted_engine::863::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) > stderr: Command VM.getStats with args {'vmID': > 'cc7931ff-8124-4724-9242-abea2ab5bf42'} failed: > (code=1, message=Virtual machine does not exist: {'vmId': > 'cc7931ff-8124-4724-9242-abea2ab5bf42'}) > > MainThread::INFO::2022-08-21 > 09:49:18,814::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) > Engine VM started on localhost > MainThread::INFO::2022-08-21 > 09:49:18,999::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineStart-EngineStarting) > sent? ignored > MainThread::INFO::2022-08-21 > 09:49:19,008::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStarting (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:29,027::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > VM is powering up.. > MainThread::INFO::2022-08-21 > 09:49:29,033::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) > Timeout set to Sun Aug 21 09:59:29 2022 while transitioning <class > 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> > MainThread::INFO::2022-08-21 > 09:49:29,109::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStarting (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:38,121::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > VM is powering up.. > MainThread::INFO::2022-08-21 > 09:49:38,195::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStarting (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:48,218::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) > Timeout cleared while transitioning <class > 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class > 'ovirt_hosted_engine_ha.agent.states.EngineUp'> > MainThread::INFO::2022-08-21 > 09:49:48,403::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineStarting-EngineUp) > sent? ignored > MainThread::INFO::2022-08-21 > 09:49:48,713::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 3400) > MainThread::INFO::2022-08-21 > 09:49:58,725::states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > Engine vm running on localhost > > > > Broker.log------------------------------ > Thread-4::INFO::2022-08-21 > 09:47:59,342::cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) > System load total=0.0241, engine=0.0013, non-engine=0.0228 > Thread-3::INFO::2022-08-21 > 09:48:01,311::mem_free::51::mem_free.MemFree::(action) memFree: 96106 > Thread-5::INFO::2022-08-21 > 09:48:05,612::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2022-08-21 > 09:48:08,591::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge > ovirtmgmt in up state > Thread-1::WARNING::2022-08-21 > 09:48:10,352::network::121::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2022-08-21 > 09:48:10,352::network::92::network.Network::(action) Failed to verify network > status, (4 out of 5) > Thread-3::INFO::2022-08-21 > 09:48:11,389::mem_free::51::mem_free.MemFree::(action) memFree: 96089 > Thread-5::INFO::2022-08-21 > 09:48:15,707::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2022-08-21 > 09:48:18,662::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge > ovirtmgmt in up state > Thread-1::WARNING::2022-08-21 > 09:48:18,879::network::121::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-3::INFO::2022-08-21 > 09:48:21,467::mem_free::51::mem_free.MemFree::(action) memFree: 96072 > Thread-1::WARNING::2022-08-21 > 09:48:24,904::network::121::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-5::INFO::2022-08-21 > 09:48:25,808::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2022-08-21 > 09:48:28,740::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge > ovirtmgmt in up state > Thread-1::WARNING::2022-08-21 > 09:48:30,416::network::121::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2022-08-21 > 09:48:30,416::network::92::network.Network::(action) Failed to verify network > status, (2 out of 5) > Thread-3::INFO::2022-08-21 > 09:48:31,545::mem_free::51::mem_free.MemFree::(action) memFree: 96064 > Thread-5::INFO::2022-08-21 > 09:48:35,909::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-1::WARNING::2022-08-21 > 09:48:35,940::network::121::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2022-08-21 > 09:48:37,480::network::92::network.Network::(action) Failed to verify network > status, (4 out of 5) > Thread-2::INFO::2022-08-21 > 09:48:38,809::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge > ovirtmgmt in up state > Thread-3::INFO::2022-08-21 > 09:48:41,623::mem_free::51::mem_free.MemFree::(action) memFree: 96014 > Thread-1::INFO::2022-08-21 > 09:48:42,549::network::88::network.Network::(action) Successfully verified > network status > Thread-5::INFO::2022-08-21 > 09:48:46,011::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Listener::ERROR::2022-08-21 > 09:48:46,639::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) > (530, b'5.7.1 Authentication required', 'ale...@example.org.hr') > > At first I thought that it is related to this bugs. > https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTEDMM4HRFBADIXEQNB4/ > https://bugzilla.redhat.com/show_bug.cgi?id=1984356 > > But in this oVirt version that bug should already be solved. Indeed - as you can see in your log snippet above, dig was called with '+tcp'. > I was trying to monitor network, but this error keeps happening even if > network load is low. It might be something else. > I did try to do continuous dig and ping commands on VM-s running on same host > as Hosted Engine, and did not have any network problems, not even one > connection drop. What exactly do you test? Before fixing the above bug, we added to ovirt-system-tests loops of 'dig', and did see drops - not many, but enough, apparently, and often. The 'dig' test is not very configurable, from -ha's POV - but you do have control over it from elsewhere - resolv.conf, your name server, etc. Also, note that it runs 'dig' without passing a query, and the default query is for '.' - the root - perhaps your name server has some problem with this? > > Any solutions or next steps i should try? You can configure the agent/broker to log at DEBUG level, to see some more details. You can also change the network monitoring method, and/or configure options for methods that do have them - e.g. 'tcp' and 'tcp_t_address', 'tcp_t_port'. See e.g.: https://www.ovirt.org/documentation/administration_guide/index.html#Administering_the_Manager_Virtual_Machine https://www.ovirt.org/develop/release-management/features/sla/hosted-engine-network-check.html Best regards, -- Didi _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2CNLAZZF2D5PGEGINZVQ4QV22J27BN65/