[ovirt-users] Re: vdsm should decouple with managed glusterfs services

Sahina Bose Mon, 18 Mar 2019 04:21:14 -0700

On Mon, Mar 18, 2019 at 4:15 PM levin <[email protected]> wrote:
>
> Hi Sahina,
>
> My cluster did not enabled fencing, is somewhere I can disable restart policy 
> from vdsm completely? So I can observe this case next time, and giving a fist 
> investigation on unresponsive node.
>
+Martin Perina - do you know if this is possible?


> Regards,
> Levin
>
>
> On 18/3/2019, 17:40, "Sahina Bose" <[email protected]> wrote:
>
>     On Sun, Mar 17, 2019 at 12:56 PM <[email protected]> wrote:
>     >
>     > Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt 
> cluster total outage due to vdsm reactivate the unresponsive node, and cause 
> the multiple glusterfs daemon restart. As a result, all VM was paused and 
> some of disk image was corrupted.
>     >
>     > At the very beginning, one of the ovirt node was overloaded due to high 
> memory and CPU, the hosted-engine have trouble to collect status from vdsm 
> and mark it as unresponsive and it start migrate the workload to healthy 
> node. However, when it start migrate, second ovirt node being unresponsive 
> where vdsm try reactivate the 1st unresponsive node and restart it's 
> glusterd. So the gluster domain was acquiring the quorum and waiting for 
> timeout.
>     >
>     > If 1st node reactivation is success and every other node can survive 
> the timeout, it will be an idea case. Unfortunately, the second node cannot 
> pick up the VM being migrated due to gluster I/O timeout, so second node at 
> that moment was marked as unresponsive, and so on... vdsm is restarting the 
> glusterd on the second node which cause disaster. All node are racing on 
> gluster volume self-healing, and i can't mark the cluster as maintenance mode 
> as well. What I can do is try to resume the paused VM via virsh and issue 
> shutdown for each domain, also hard shutdown for un-resumable VM.
>     >
>     > After number of VM shutdown and wait the gluster healing completed,  
> the cluster state back to normal, and I try to start the VM being manually 
> stopped, most of them can be started normally, but number of VM was crashed 
> or un-startable, instantly I  found the image files of un-startable VM was 
> owned by root(can't explain why), and can be restarted after chmod.  Two of 
> them still cannot start with  "bad volume specification" error. One of them 
> can start to boot loader, but the LVM metadata were lost.
>     >
>     > The impact was huge when vdsm restart the glusterd without human 
> invention.
>
>     Is this even with the fencing policies set for ensuring gluster quorum
>     is not lost?
>
>     There are 2 policies that you need to enable at the cluster level -
>     Skip fencing if Gluster bricks are UP
>     Skip fencing if Gluster quorum not met
>
>
>     > _______________________________________________
>     > Users mailing list -- [email protected]
>     > To unsubscribe send an email to [email protected]
>     > Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>     > oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
>     > List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/NIPHD7COR5ZBVQROOUU6R4Q45SDFAJ5K/
>
>
>
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/Z6OBWXWFMIWILZWEMZLEJNRMHL3VLBDF/

[ovirt-users] Re: vdsm should decouple with managed glusterfs services

Reply via email to