On Jun 6, 2019 12:52, [email protected] wrote:
>
> Hello,
>
> I came upon a problem the previous month that I figured it would be good to
> discuss here. I'm sorry I didn't post here earlier but time slipped me.
>
> I have set up a glustered, hyperconverged oVirt environment for experimental
> use as a means to see its behaviour and get used to its management and
> performance before setting it up as a production environment for use in our
> organization. The environment is up and running since 2018 October. The three
> nodes are HP ProLiant DL380 G7 and have the following characteristics:
>
> Mem: 22GB
> CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx
> HDD: 5x 300GB
> Network: BCM5709C with dual-port Gigabit
> OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node
> 4.2.3.1
>
> As I was working on the environment, the engine stopped working.
> Not long before the time the HE stopped, I was in the web interface managing
> my VMs, when the browser froze and the HE was also not responding to ICMP
> requests.
>
> The first thing I did was to connect via ssh to all nodes and run the command
> #hosted-engine --vm-status
> which showed that the HE was down in nodes 1 and 2 and up on the 3rd node.
>
> After executing
> #virsh -r list
> the VM list that was shown contained two of the VMs I had previously created
> and were up; the HE was nowhere.
>
> I tried to restart the HE with the
> #hosted-engine --vm-start
> but it didn't work.
>
> I then put all nodes in maintenance mode with the command
> #hosted-engine --set-maintenance --mode=global
> (I guess I should have done that earlier) and re-run
> #hosted-engine --vm-start
> that had the same result as it previously did.
>
> After checking the mails the system sent to the root user, I saw there were
> several mails on the 3rd node (where the HE had been), informing of the HE's
> state. The messages were changing between EngineDown-EngineStart,
> EngineStart-EngineStarting, EngineStarting-EngineMaybeAway,
> EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown,
> EngineDown-EngineStart and so forth.
>
> I continued by searching the following logs in all nodes :
> /var/log/libvirt/qemu/HostedEngine.log
> /var/log/libvirt/qemu/win10.log
> /var/log/libvirt/qemu/DNStest.log
> /var/log/vdsm/vdsm.log
> /var/log/ovirt-hosted-engine-ha/agent.log
>
> After that I spotted and error that had started appearing almost a month ago
> in node #2:
> ERROR Internal server error Traceback (most recent call last): File
> "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in
> _handle_request res = method(**params) File
> "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in
> _dynamicMethod result = fn(*methodArgs) File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in
> logicalVolumeList return self._gluster.logicalVolumeList() File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper
> rv = func(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in
> logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in
> __call__ return callMethod() File
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in
> <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
> AttributeError: 'AutoProxy[instance]' object has no attribute
> 'glusterLogicalVolumeList'
>
>
> The outputs of the following commands were also checked as a way to see if
> there was a mandatory process missing/killed, a memory problem or even disk
> space shortage that led to the sudden death of a process
> #ps -A
> #top
> #free -h
> #df -hT
>
> Finally, after some time delving in the logs, the output of the
> #journalctl --dmesg
> showed the following message
> "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child.
> Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB,
> file-rss:2336kB, shmem-rss:12kB"
> which after that the ovirtmgmt started not responding.
If you run out of memory, you should take that serious.Droping the cache seems
like a workaround and not a fix.
Check if KSM is enabled - this will merge your VM's memory pages for an
exchange for CPU cycles - still better than getting a VM killed.
Also, you can protect the HostedEngine from OOM killer.
> I tried to restart the vhostd by executing
> #/etc/rc.d/init.d/vhostmd start
> but it didn't work.
>
> Finally, I decided to run the HE restart command on the other nodes as well
> (I'd figured that since the HE was last running on the node #3, that's where
> I should try to restart it). So, I run
> #hosted-engine --vm-start
> and the output was
> "Command VM.getStats with args {'vmID':'...<το ID της HE>....'} failed:
> (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της
> HE>....'})"
> And then I run the command again and the output was
> "VM exists and its status is Powering Up."
>
> After that I executed
> #virsh -r list
> and the output was the following:
> Id Name State
> ----------------------------------------------------
> 2 HostedEngine running
>
> After the HE's restart two mails came that stated:
> ReinitializeFSMEngineStarting and EngineStarting-EngineUp
>
> After that and after checking that we had access to the web interface again,
> we executed
> hosted-engine --set-maintenance --mode=none
> to get out of the maintenance mode.
>
> The thing is, I still am not 1000% sure what the problem was that led to the
> shutdown of the hosted engine and I think that maybe some of the steps I took
> were not needed. I believe it was because the process qemu-kvm was killed
> after there was not enough memory for it but is this the real cause? I wasn't
> doing anything unusual before the shutdown to believe it was because of the
> new VM that was still in shutdown mode or anything of the sort. Also, I
> believe it may be because of memory shortage because I hadn't executed the
> #sync ; echo 3 > /proc/sys/vm/drop_caches
> command for a couple of weeks.
>
> What are your thoughts on this? Could you point me to where to search for
> more information on the topic or tell me what is the right process to follow
> when something like this happens?
Check the sar (there is a graphical util called 'ksar' and check cpu , memory,
swap, context switches , I/O and network usage).
Crreate simple systemd service to monitor your nodes, or even better put a real
monitoring software so you can proactively take any actions.
> Also, I have set up a few VMs but only three are Up and they have no users
> yet, even so the buffers fill almost to the brim when the usage is almost
> non-existant. If you have an environment that has some users or you use the
> VMs as virtual servers of some sort, what is the consumption of the memory?
> What's the optimal size for the memory?
What is your tuned profile ? Any customizations there ?
Best Regards,
Strahil Nikolov
> Thank you all very much.
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/[email protected]/message/PKRB26GSDQ5JVHD75HEPK346NTI7UQK2/
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/[email protected]/message/R6KLODYO4T5TKCSIULXQD2SEWGS74WTQ/