Re: [Ocfs2-users] input / out error on some nodes

Eric Ren Mon, 26 Oct 2015 19:07:37 -0700

Hi,

Did you test on an pure ocfs2 volume, or with ceph rdb?

I tried your steps on my side just without ceph rdb, and didn't see yourissue.

node1                                                          node2
n1:~ # mkdir /mnt/shared/test1
n1:~ # cd /mnt/shared/test1/
n2:~ # mv /mnt/shared/test1/ /mnt/shared/test2
n1:/mnt/shared/test1 # ll /mnt/shared/
drwxr-xr-x 2 root root 3896 Oct 26 21:18 lost+found
drwxr-xr-x 2 root root 3896 Oct 27 09:46 test2
 n2:~ # ll /mnt/shared/
drwxr-xr-x 2 root root 3896 Oct 26 21:18 lost+found
drwxr-xr-x 2 root root 3896 Oct 27 09:46 test2

Hope you can further isolate your problem. Again, firstly make sure youhave an ocfs2 cluster in good condition!

BTW, please response what I've asked you in previous email if possible ;-)

Thanks,
Eric

On 10/26/15 16:28, gjprabu wrote:


Hi Eric,

We identified the issue. When we do simultaneous access on the samedirectory its having i/o error. But normally cluster filesytem willhandle this, in our cause its not working. ocfs2 versionocfs2-tools-1.8.0-16.



Exmaple

Node1 : cd /home/downloads/test

Node2 : mv /home/downloads/test /home/downloads/test1


Node1

ls -al /home/downloads/

d?????????   ? ?     ?         ?            ?   test1


Node2

ls -al /home/downloads/

drwxr-xr-x    2 root  root  3.9K Oct 26 12:06 test1



Regards
Prabu

---- On Mon, 26 Oct 2015 08:10:06 +0530 *Eric Ren <z...@suse.com>*wrote ----


    Hi,

    On 10/22/15 21:00, gjprabu wrote:

        Hi Eric,

        Thanks for your reply, Still we are facing same issue. we
        found this dmesg logs and this is known logs because our self
        made down node1 and made up this is showing in logs and other
        then we didn't found error message. Even we do have problem
        while unmounting. umount process goes to "D" stat and fsck
        through fsck.ocfs2: I/O error. If required to run any other
        command pls let me know.

    1. system log over boots
    #journalctl --list-boots
    If there is just one boot record, please " man journald.conf" to
    configure saving system logs over boots.
    so, you can use "journalctl -b xxx" to see any specific boot
    system log.

    I can't see what steps exactly lead to that error message? Better
    to tidy up your problems from clean state.

    2. umount issue may be caused by the bad condition cluster.
    Communication between nodes hung up.

    3. please using device instead of mount point.

    4. Did you build up CEPH  RBD based on a good conditional ocfs2
    cluster? It's better test more if cluster is
    good before working on it.


    Thanks,
    Eric

        *ocfs2 version*
        debugfs.ocfs2 1.8.0

        *# cat /etc/sysconfig/o2cb*
        #
        # This is a configuration file for automatic startup of the O2CB
        # driver.  It is generated by running /etc/init.d/o2cb configure.
        # On Debian based systems the preferred method is running
        # 'dpkg-reconfigure ocfs2-tools'.
        #

        # O2CB_STACK: The name of the cluster stack backing O2CB.
        O2CB_STACK=o2cb

        # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
        O2CB_BOOTCLUSTER=ocfs2

        # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is
        considered dead.
        O2CB_HEARTBEAT_THRESHOLD=31

        # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection
        is considered dead.
        O2CB_IDLE_TIMEOUT_MS=30000

        # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive
        packet is sent
        O2CB_KEEPALIVE_DELAY_MS=2000

        # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection
        attempts
        O2CB_RECONNECT_DELAY_MS=2000

        *# fsck.ocfs2 -fy /home/build/downloads/*
        fsck.ocfs2 1.8.0
        fsck.ocfs2: I/O error on channel while opening
        "/zoho/build/downloads/"

        _*dmesg logs*_

        [ 4229.886284] o2dlm: Joining domain
        A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes
        [ 4251.437451] o2dlm: Node 3 joins domain
        A895BC216BE641A8A7E20AA89D57E051 ( 3 5 ) 2 nodes
        [ 4267.836392] o2dlm: Node 1 joins domain
        A895BC216BE641A8A7E20AA89D57E051 ( 1 3 5 ) 3 nodes
        [ 4292.755589] o2dlm: Node 2 joins domain
        A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 5 ) 4 nodes
        [ 4306.262165] o2dlm: Node 4 joins domain
        A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
        [316476.505401]
        (kworker/u192:0,95923,0):dlm_do_assert_master:1717 ERROR:
        Error -112 when sending message 502 (key 0xc3460ae7) to node 1
        [316476.505470] o2cb: o2dlm has evicted node 1 from domain
        A895BC216BE641A8A7E20AA89D57E051
        [316480.437231] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316480.442389] o2cb: o2dlm has evicted node 1 from domain
        A895BC216BE641A8A7E20AA89D57E051
        [316480.442412]
        (kworker/u192:0,95923,20):dlm_begin_reco_handler:2765
        A895BC216BE641A8A7E20AA89D57E051: dead_node previously set to
        1, node 3 changing it to 1
        [316480.541237] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316480.541241] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316485.542733] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316485.542740] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316485.542742] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316490.544535] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316490.544538] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316490.544539] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316495.546356] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316495.546362] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316495.546364] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316500.548135] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316500.548139] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316500.548140] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316505.549947] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316505.549951] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316505.549952] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316510.551734] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316510.551739] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316510.551740] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316515.553543] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316515.553547] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316515.553548] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316520.555337] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316520.555341] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316520.555343] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316525.557131] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316525.557136] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316525.557153] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316530.558952] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316530.558955] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316530.558957] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [316535.560781] o2dlm: Begin recovery on domain
        A895BC216BE641A8A7E20AA89D57E051 for node 1
        [316535.560789] o2dlm: Node 3 (he) is the Recovery Master for
        the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051
        [316535.560792] o2dlm: End recovery on domain
        A895BC216BE641A8A7E20AA89D57E051
        [319419.525609] o2dlm: Node 1 joins domain
        A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes



        *ps -auxxxxx | grep umount*

root 32083 21.8 0.0 125620 2828 pts/14 D+ 19:370:18 umount /home/build/repositoryroot 32196 0.0 0.0 112652 2264 pts/8 S+ 19:380:00 grep --color=auto umount



        *cat /proc/32083/stack*
        [<ffffffff8132ad7d>] o2net_send_message_vec+0x71d/0xb00
        [<ffffffff81352148>]
        dlm_send_remote_unlock_request.isra.2+0x128/0x410
        [<ffffffff813527db>] dlmunlock_common+0x3ab/0x9e0
        [<ffffffff81353088>] dlmunlock+0x278/0x800
        [<ffffffff8131f765>] o2cb_dlm_unlock+0x35/0x50
        [<ffffffff8131ecfe>] ocfs2_dlm_unlock+0x1e/0x30
        [<ffffffff812a8776>] ocfs2_drop_lock.isra.29.part.30+0x1f6/0x700
        [<ffffffff812ae40d>] ocfs2_simple_drop_lockres+0x2d/0x40
        [<ffffffff8129b43c>] ocfs2_dentry_lock_put+0x5c/0x80
        [<ffffffff8129b4a2>] ocfs2_dentry_iput+0x42/0x1d0
        [<ffffffff81204dc2>] __dentry_kill+0x102/0x1f0
        [<ffffffff81205294>] shrink_dentry_list+0xe4/0x2a0
        [<ffffffff81205aa8>] shrink_dcache_parent+0x38/0x90
        [<ffffffff81205b16>] do_one_tree+0x16/0x50
        [<ffffffff81206e9f>] shrink_dcache_for_umount+0x2f/0x90
        [<ffffffff811efb15>] generic_shutdown_super+0x25/0x100
        [<ffffffff811eff57>] kill_block_super+0x27/0x70
        [<ffffffff811f02a9>] deactivate_locked_super+0x49/0x60
        [<ffffffff811f089e>] deactivate_super+0x4e/0x70
        [<ffffffff8120da83>] cleanup_mnt+0x43/0x90
        [<ffffffff8120db22>] __cleanup_mnt+0x12/0x20
        [<ffffffff81093ba4>] task_work_run+0xc4/0xe0
        [<ffffffff81013c67>] do_notify_resume+0x97/0xb0
        [<ffffffff817d2ee7>] int_signal+0x12/0x17
        [<ffffffffffffffff>] 0xffffffffffffffff

        Regards
        Prabu




        ---- On Wed, 21 Oct 2015 08:32:15 +0530 *Eric Ren
        <z...@suse.com> <mailto:z...@suse.com>* wrote ----

            Hi Prabu,

            I guess others like me are not familiar with this case
            that combine CEPH RBD and OCFS2.

            We'd really like to help you. But I think ocfs2 developers
            cannot get any info about what happened
            to ocfs2 from your descriptions.

            So, I'm wondering if you can reproduce and tell us the
            steps. Once developers can reproduce it,
            it's likely be resolved;-) BTW, any dmesg log about ocfs2
            especially the initial error message and stack
            back trace will be helpful!

            Thanks,
            Eric

            On 10/20/15 17:29, gjprabu wrote:

                Hi

                        We are looking forward to your input on this.

                Regads
                Prabu

                --- On Fri, 09 Oct 2015 12:08:19 +0530 *gjprabu
                <gjpr...@zohocorp.com> <mailto:gjpr...@zohocorp.com>*
                wrote ----





                        Hi All,

                                 Anybody pls help me on this issue.

                        Regards
                        Prabu




                        ---- On Thu, 08 Oct 2015 12:33:57 +0530
                        *gjprabu <gjpr...@zohocorp.com
                        <mailto:gjpr...@zohocorp.com>>* wrote ----



                            Hi All,

                                   We have CEPH  RBD with OCFS2
                            mounted servers. we are facing i/o errors
                            simultaneously while move the data's in
                            the same disk (Copying is not having any
                            problem). Temporary we remount the
                            partition and the issue get resolved but
                            after sometime problem again reproduced.
                            If anybody faced same issue. Please help us.

                            Note : We have total 5 Nodes, here two
                            nodes working fine other nodes are showing
                            like below input/output error.

                            ls -althr
                            ls: cannot access LITE_3_0_M4_1_TEST:
                            Input/output error
                            ls: cannot access LITE_3_0_M4_1_OLD:
                            Input/output error
                            total 0
                            d????????? ? ? ? ? ? LITE_3_0_M4_1_TEST
                            d????????? ? ? ? ? ? LITE_3_0_M4_1_OLD

                            cluster:
                                   node_count=5
                                   heartbeat_mode = local
                                   name=ocfs2

                            node:
                                    ip_port = 7777
                                    ip_address = 192.168.113.42
                                    number = 1
                                    name = integ-hm9
                                    cluster = ocfs2

                            node:
                                    ip_port = 7777
                                    ip_address = 192.168.112.115
                                    number = 2
                                    name = integ-hm2
                                    cluster = ocfs2

                            node:
                                    ip_port = 7777
                                    ip_address = 192.168.113.43
                                    number = 3
                                    name = integ-ci-1
                                    cluster = ocfs2
                            node:
                                    ip_port = 7777
                                    ip_address = 192.168.112.217
                                    number = 4
                                    name = integ-hm8
                                    cluster = ocfs2
                            node:
                                    ip_port = 7777
                                    ip_address = 192.168.112.192
                                    number = 5
                                    name = integ-hm5
                                    cluster = ocfs2


                            Regards
                            Prabu



                            _______________________________________________

                            Ocfs2-users mailing list
                            Ocfs2-users@oss.oracle.com
                            <mailto:Ocfs2-users@oss.oracle.com>
                            https://oss.oracle.com/mailman/listinfo/ocfs2-users




                _______________________________________________
                Ocfs2-users mailing list
                Ocfs2-users@oss.oracle.com
                <mailto:Ocfs2-users@oss.oracle.com>  
https://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] input / out error on some nodes

Reply via email to