Hi Chris,

Replies inline..

On 09/22/2015 09:31 AM, Sahina Bose wrote:



-------- Forwarded Message --------
Subject:        Re: [ovirt-users] urgent issue
Date:   Wed, 9 Sep 2015 08:31:07 -0700
From:   Chris Liebman <[email protected]>
To:     users <[email protected]>



Ok - I think I'm going to switch to local storage - I've had way to many unexplainable issue with glusterfs  :-(. Is there any reason I cant add local storage to the existing shared-storage cluster? I see that the menu item is greyed out....



What version of gluster and ovirt are you using?




On Tue, Sep 8, 2015 at 4:19 PM, Chris Liebman <[email protected] <mailto:[email protected]>> wrote:

Its possible that this is specific to just one gluster volume... I've moved a few VM disks off of that volume and am able to start
    them fine.  My recolection is that any VM started on the "bad"
    volume causes it to be disconnected and forces the ovirt node to
    be marked down until Maint->Activate.

    On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman
    <[email protected]> wrote:

        In attempting to put an ovirt cluster in production I'm
        running into some off errors with gluster it looks like.  Its
        12 hosts each with one brick in distributed-replicate.
        Â (actually 2 bricks but they are separate volumes)


These 12 nodes in dist-rep config, are they in replica 2 or replica 3? The latter is what is recommended for VM use-cases. Could you give the output of `gluster volume info` ?

        [root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm

        vdsm-jsonrpc-4.16.20-0.el6.noarch

        vdsm-gluster-4.16.20-0.el6.noarch

        vdsm-xmlrpc-4.16.20-0.el6.noarch

        vdsm-yajsonrpc-4.16.20-0.el6.noarch

        vdsm-4.16.20-0.el6.x86_64

        vdsm-python-zombiereaper-4.16.20-0.el6.noarch

        vdsm-python-4.16.20-0.el6.noarch

        vdsm-cli-4.16.20-0.el6.noarch


        Â  Â Everything was fine last week, however, today various
        clients in the gluster cluster seem get "client quorum not
        met" periodically - when they get this they take one of the
        bricks offline - this causes VM's to be attempted to move -
        sometimes 20 at a time.  That takes a long time :-(. I've
        tried disabling automatic migration and teh VM's get paused
        when this happens - resuming gets nothing at that point as the
        volumes mount on the server hosting the VM is not connected:


        from
        
rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:

        [2015-09-08 21:18:42.920771] W [MSGID: 108001]
        [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2:
        Client-quorum is not met


When client-quorum is not met (due to network disconnects, or gluster brick processes going down etc), gluster makes the volume read-only. This is expected behavior and prevents split-brains. It's probably a bit late, but do you have the gluster fuse mount logs to confirm this indeed was the issue?

        [2015-09-08 21:18:42.931751] I
        [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting
        
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02

        [2015-09-08 21:18:42.931836] W
        [glusterfsd.c:1219:cleanup_and_exit]
        (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51]
        -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
        -->/usr/sbin/glusterfs(cleanup_and_exit+0x

        65) [0x4059b5] ) 0-: received signum (15), shutting down

        [2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini]
        0-fuse: Unmounting
        
'/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.


The VM pause you saw could be because of the unmount.I understand that a fix (https://gerrit.ovirt.org/#/c/40240/) went in for ovirt 3-.6 (vdsm-4.17) to prevent vdsm from unmounting the gluster volume when vdsm exits/restarts. Is it possible to run a test setup on 3.6 and see if this is still happening?


        And the mount is broken at that point:

        [root@ovirt-node267 ~]# df

        *df:
        
`/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':
        Transport endpoint is not connected*


Yes because it received a SIGTERM above.

Thanks,
Ravi

        Filesystem           1K-blocks  Â
          Used  Available Use% Mounted on

        /dev/sda3Â  Â  Â  Â  Â  Â
        Â Â 51475068Â Â Â 1968452Â Â Â 46885176Â Â Â 5% /

        tmpfs                132210244     Â
        Â Â 0Â Â 132210244Â Â Â 0% /dev/shm

        /dev/sda2Â  Â  Â  Â  Â  Â  Â Â Â 487652Â Â  Â Â 32409Â Â
        Â Â 429643Â Â Â 8% /boot

        /dev/sda1Â  Â  Â  Â  Â  Â  Â Â Â 204580Â Â  Â  Â Â 260Â Â
        Â Â 204320Â Â Â 1% /boot/efi

        /dev/sda5Â  Â  Â  Â  Â Â Â 1849960960 156714056
        1599267616Â Â Â 9% /data1

        /dev/sdb1Â  Â  Â  Â  Â Â Â 1902274676Â Â 18714468
        1786923588Â Â Â 2% /data2

        ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01

        Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 9249804800 727008640
        8052899712 <tel:8052899712>Â Â Â 9%
        
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01

        ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03

        Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 1849960960Â Â  Â Â 73728
        1755907968Â Â Â 1%
        
/rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03

        The fix for that is to put the server in maintenance mode then
        activate it again. But all VM's need to be migrated or stopped
        for that to work.


        I'm not seeing any obvious network or disk errors......Â

        Are their configuration options I'm missing?






_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to