On Mon, Mar 18, 2019 at 4:15 PM levin <[email protected]> wrote: > > Hi Sahina, > > My cluster did not enabled fencing, is somewhere I can disable restart policy > from vdsm completely? So I can observe this case next time, and giving a fist > investigation on unresponsive node. > +Martin Perina - do you know if this is possible?
> Regards, > Levin > > > On 18/3/2019, 17:40, "Sahina Bose" <[email protected]> wrote: > > On Sun, Mar 17, 2019 at 12:56 PM <[email protected]> wrote: > > > > Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt > cluster total outage due to vdsm reactivate the unresponsive node, and cause > the multiple glusterfs daemon restart. As a result, all VM was paused and > some of disk image was corrupted. > > > > At the very beginning, one of the ovirt node was overloaded due to high > memory and CPU, the hosted-engine have trouble to collect status from vdsm > and mark it as unresponsive and it start migrate the workload to healthy > node. However, when it start migrate, second ovirt node being unresponsive > where vdsm try reactivate the 1st unresponsive node and restart it's > glusterd. So the gluster domain was acquiring the quorum and waiting for > timeout. > > > > If 1st node reactivation is success and every other node can survive > the timeout, it will be an idea case. Unfortunately, the second node cannot > pick up the VM being migrated due to gluster I/O timeout, so second node at > that moment was marked as unresponsive, and so on... vdsm is restarting the > glusterd on the second node which cause disaster. All node are racing on > gluster volume self-healing, and i can't mark the cluster as maintenance mode > as well. What I can do is try to resume the paused VM via virsh and issue > shutdown for each domain, also hard shutdown for un-resumable VM. > > > > After number of VM shutdown and wait the gluster healing completed, > the cluster state back to normal, and I try to start the VM being manually > stopped, most of them can be started normally, but number of VM was crashed > or un-startable, instantly I found the image files of un-startable VM was > owned by root(can't explain why), and can be restarted after chmod. Two of > them still cannot start with "bad volume specification" error. One of them > can start to boot loader, but the LVM metadata were lost. > > > > The impact was huge when vdsm restart the glusterd without human > invention. > > Is this even with the fencing policies set for ensuring gluster quorum > is not lost? > > There are 2 policies that you need to enable at the cluster level - > Skip fencing if Gluster bricks are UP > Skip fencing if Gluster quorum not met > > > > _______________________________________________ > > Users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/ > > > _______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/Z6OBWXWFMIWILZWEMZLEJNRMHL3VLBDF/

