[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

Strahil Thu, 06 Jun 2019 07:04:43 -0700

On Jun 6, 2019 12:52, [email protected] wrote:
>
> Hello, 
>
> I came upon a problem the previous month that I figured it would be good to 
> discuss here. I'm sorry I didn't post here earlier but time slipped me. 
>
> I have set up a glustered, hyperconverged oVirt environment for experimental 
> use as a means to see its  behaviour and get used to its management and 
> performance before setting it up as a production environment for use in our 
> organization. The environment is up and running since 2018 October. The three 
> nodes are HP ProLiant DL380 G7 and have the following characteristics: 
>
> Mem: 22GB 
> CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx 
> HDD: 5x 300GB 
> Network: BCM5709C with dual-port Gigabit 
> OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node 
> 4.2.3.1 
>
> As I was working on the environment, the engine stopped working. 
> Not long before the time the HE stopped, I was in the web interface managing 
> my VMs, when the browser froze and the HE was also not responding to ICMP 
> requests. 
>
> The first thing I did was to connect via ssh to all nodes and run the command 
> #hosted-engine --vm-status 
> which showed that the HE was down in nodes 1 and 2 and up on the 3rd node. 
>
> After executing 
> #virsh -r list 
> the VM list that was shown contained two of the VMs I had previously created 
> and were up; the HE was nowhere. 
>
> I tried to restart the HE with the 
> #hosted-engine --vm-start 
> but it didn't work. 
>
> I then put all nodes in maintenance mode with the command 
> #hosted-engine --set-maintenance --mode=global 
> (I guess I should have done that earlier) and re-run 
> #hosted-engine --vm-start 
> that had the same result as it previously did. 
>
> After checking the mails the system sent to the root user, I saw there were 
> several mails on the 3rd node (where the HE had been), informing of the HE's 
> state. The messages were changing between EngineDown-EngineStart, 
> EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, 
> EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, 
> EngineDown-EngineStart and so forth. 
>
> I continued by searching the following logs in all nodes : 
> /var/log/libvirt/qemu/HostedEngine.log 
> /var/log/libvirt/qemu/win10.log 
> /var/log/libvirt/qemu/DNStest.log 
> /var/log/vdsm/vdsm.log 
> /var/log/ovirt-hosted-engine-ha/agent.log 
>
> After that I spotted and error that had started appearing almost a month ago 
> in node #2: 
> ERROR Internal server error Traceback (most recent call last): File 
> "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in 
> _handle_request res = method(**params) File 
> "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in 
> _dynamicMethod result = fn(*methodArgs) File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in 
> logicalVolumeList return self._gluster.logicalVolumeList() File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper 
> rv = func(*args, **kwargs) File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in 
> logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File 
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in 
> __call__ return callMethod() File 
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in 
> <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, 
> AttributeError: 'AutoProxy[instance]' object has no attribute 
> 'glusterLogicalVolumeList' 
>
>
> The outputs of the following commands were also checked as a way to see if 
> there was a mandatory process missing/killed, a memory problem or even disk 
> space shortage that led to the sudden death of a process 
> #ps -A 
> #top 
> #free -h 
> #df -hT 
>
> Finally, after some time delving in the logs, the output of the 
> #journalctl --dmesg 
> showed the following message 
> "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child. 
> Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB, 
> file-rss:2336kB, shmem-rss:12kB" 
> which after that the ovirtmgmt started not responding. 
If you run out of memory, you should take that serious.Droping the cache seems 
like a workaround and not a fix.
Check if KSM is enabled - this will merge your VM's memory pages for an 
exchange for CPU cycles - still better than getting a VM killed.
Also, you can protect the HostedEngine from OOM killer.


> I tried to restart the vhostd by executing 
> #/etc/rc.d/init.d/vhostmd start 
> but it didn't work. 
>
> Finally, I decided to run the HE restart command on the other nodes as well 
> (I'd figured that since the HE was last running on the node #3, that's where 
> I should try to restart it). So, I run 
> #hosted-engine --vm-start 
> and the output was 
> "Command VM.getStats with args {'vmID':'...<το ID της HE>....'} failed: 
> (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της 
> HE>....'})" 
> And then I run the command again and the output was 
> "VM exists and its status is Powering Up." 
>
> After that I executed 
> #virsh -r list 
> and the output was the following: 
> Id     Name                   State 
> ---------------------------------------------------- 
> 2      HostedEngine      running 
>
> After the HE's restart two mails came that stated: 
> ReinitializeFSMEngineStarting and EngineStarting-EngineUp 
>
> After that and after checking that we had access to the web interface again, 
> we executed 
> hosted-engine --set-maintenance --mode=none 
> to get out of the maintenance mode. 
>
> The thing is, I still am not 1000% sure what the problem was that led to the 
> shutdown of the hosted engine and I think that maybe some of the steps I took 
> were not needed. I believe it was because the process qemu-kvm was killed 
> after there was not enough memory for it but is this the real cause? I wasn't 
> doing anything unusual before the shutdown to believe it was because of the 
> new VM that was still in shutdown mode or anything of the sort. Also, I 
> believe it may be because of memory shortage because I hadn't executed the 
> #sync ; echo 3 > /proc/sys/vm/drop_caches 
> command for a couple of weeks. 
>
> What are your thoughts on this? Could you point me to where to search for 
> more information on the topic or tell me what is the right process to follow 
> when something like this happens? 

Check the sar (there is a graphical util called 'ksar' and check cpu , memory, 
swap, context switches , I/O and network usage).
Crreate simple systemd service to monitor your nodes, or even better put a real 
monitoring software so you can proactively take any actions.


> Also, I have set up a few VMs but only three are Up and they have no users 
> yet, even so the buffers fill almost to the brim when the usage is almost 
> non-existant. If you have an environment that has some users or you use the 
> VMs as virtual servers of some sort, what is the consumption of the memory? 
> What's the optimal size for the memory? 

What is your tuned profile ? Any customizations there ?

Best Regards,
Strahil Nikolov
> Thank you all very much.
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/PKRB26GSDQ5JVHD75HEPK346NTI7UQK2/
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/R6KLODYO4T5TKCSIULXQD2SEWGS74WTQ/

[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

Reply via email to