Done. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1333
Thanks. Regards, Sérgio Surkamp Em Fri, 02 Sep 2011 10:12:18 -0700 Sunil Mushran <sunil.mush...@oracle.com> escreveu: > Log what you have in a bz. I can take a look. I doubt you will be > able to attach that file though. You'll need to provide me with a > link. > > On 09/02/2011 07:28 AM, Sérgio Surkamp wrote: > > Hello, > > > > We have got a problem this morning with our cluster. > > > > Cluster setup: > > > > Servers: > > * Two R800 Dell servers running CentOS 5.5 and ULEK > > 2.6.32-100.0.19.el5, with 8G ram each; > > * OCFS2 1.6.4; > > * iSCSI connection using two bonded Gbit nics. > > > > Storage: > > * Dell EqualLogic 4000VX -- iSCSI > > > > Network: > > * Two dell 1Gbit trunked switches; > > > > Problem description: > > > > The node #1 has hanged access to the filesystem and the hung tasks > > has almost the same stack trace as one of the following: > > > > ------------------------------------------------------------------------------- > > INFO: task maildirsize:17252 blocked for more than 120 seconds. > > "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this > > message. > > maildirsize D 0000000000000004 0 17252 17249 0x00000080 > > ffff8800b181d7f8 0000000000000086 0000000000000000 > > ffff880098de3c40 ffff88001293a1c0 ffff88022bb7c4c0 ffff88001293a598 > > ffffffffa036a654 ffff8800b181d7e8 ffffffff81043d10 ffff88001293a1f8 > > 7fffffffffffffff Call Trace: > > [<ffffffffa036a654>] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm] > > [<ffffffff81043d10>] ? update_curr+0xc9/0xd2 > > [<ffffffff8143798a>] schedule_timeout+0x36/0xe7 > > [<ffffffff810425b3>] ? need_resched+0x23/0x2d > > [<ffffffff814377fc>] wait_for_common+0xb7/0x12c > > [<ffffffff8104bc15>] ? default_wake_function+0x0/0x19 > > [<ffffffffa03b07fe>] ? lockres_clear_flags+0x15/0x17 [ocfs2] > > [<ffffffff81437914>] wait_for_completion+0x1d/0x1f > > [<ffffffffa03b15ad>] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2] > > [<ffffffffa03b1f59>] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2] > > [<ffffffffa03c6319>] ? ocfs2_inode_cache_io_unlock+0x12/0x14 > > [ocfs2] [<ffffffffa03f48f5>] ? > > ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2] > > [<ffffffffa03c344c>] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] > > [<ffffffffa03c3242>] ? ocfs2_read_inode_block_full+0x3e/0x5a > > [ocfs2] [<ffffffffa03b38dc>] > > ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2] > > [<ffffffffa03d3f8c>] ? ocfs2_rename+0x49e/0x183d [ocfs2] > > [<ffffffffa03c344c>] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2] > > [<ffffffffa03d3f8c>] ocfs2_rename+0x49e/0x183d [ocfs2] > > [<ffffffffa03a6792>] ? brelse+0x13/0x15 [ocfs2] > > [<ffffffffa03b0692>] ? init_completion+0x1f/0x21 [ocfs2] > > [<ffffffffa03b0741>] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2] > > [<ffffffffa03b0692>] ? init_completion+0x1f/0x21 [ocfs2] > > [<ffffffffa03b202c>] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad > > [ocfs2] [<ffffffff810425b3>] ? need_resched+0x23/0x2d > > [<ffffffff810e8512>] ? kstrdup+0x2b/0xc0 [<ffffffff811229eb>] > > vfs_rename+0x221/0x3c0 [<ffffffff81124968>] > > sys_renameat+0x18b/0x201 [<ffffffff81075a7c>] ? > > autoremove_wake_function+0x0/0x3d [<ffffffff811178fc>] ? > > fsnotify_modify+0x6c/0x74 [<ffffffff8112186b>] ? path_put+0x22/0x27 > > [<ffffffff811249f9>] sys_rename+0x1b/0x1d [<ffffffff81011db2>] > > system_call_fastpath+0x16/0x1b > > > > INFO: task imapd:17386 blocked for more than 120 seconds. > > "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this > > message. > > imapd D 000000000000000b 0 17386 4367 0x00000080 > > ffff880208709c08 0000000000000086 0000000000000000 > > 0000000000000286 ffff8801501ac800 ffff88012b008180 ffff8801501acbd8 > > 00000001bbc4c49f 0000000000000000 0000000000000000 0000000000000000 > > ffff880127bf8d9c Call Trace: > > [<ffffffff81437e31>] __mutex_lock_common+0x12f/0x1a1 > > [<ffffffff81437ef2>] __mutex_lock_slowpath+0x19/0x1b > > [<ffffffff81437f5b>] mutex_lock+0x23/0x3a > > [<ffffffff81121aaa>] do_lookup+0x85/0x162 > > [<ffffffff8112445c>] __link_path_walk+0x49e/0x5fb > > [<ffffffff81238204>] ? __strncpy_from_user+0x31/0x4a > > [<ffffffff8112460c>] path_walk+0x53/0x9c > > [<ffffffff81124727>] do_path_lookup+0x2f/0x7a > > [<ffffffff8112514f>] user_path_at+0x57/0x91 > > [<ffffffff810f0001>] ? handle_mm_fault+0x14b/0x7d9 > > [<ffffffff8111c366>] vfs_fstatat+0x3a/0x67 > > [<ffffffffa03b0567>] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2] > > [<ffffffff8111c479>] vfs_stat+0x1b/0x1d > > [<ffffffff8111c49a>] sys_newstat+0x1f/0x39 > > [<ffffffff8143b1e3>] ? do_page_fault+0x25d/0x26c > > [<ffffffff810a7d03>] ? audit_syscall_entry+0x103/0x12f > > [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b > > ------------------------------------------------------------------------------- > > > > When we rebooted the node#1, the fallowing recovery messages was > > logged by node#0: > > > > ------------------------------------------------------------------------------- > > o2net: connection to node XXXXXXXXXX (num 1) at ip.ip.ip.2:7777 has > > been idle for 60.0 seconds, shutting it down. > > (swapper,0,0):o2net_idle_timer:1498 here are some times that might > > help debug the situation: (tmr 1314962116.650772 now > > 1314962176.650058 dr 1314962116.650749 adv > > 1314962116.650781:1314962116.650782 func (3f8ab666:504) > > 1314962114.651682:1314962114.651687) o2net: no longer connected to > > node XXXXXXXXXX (num 1) at ip.ip.ip.2:7777 > > (dlm_thread,4143,2):dlm_send_proxy_ast_msg:456 ERROR: Error -112 > > when sending message 505 (key 0x3f8ab666) to node 1 > > (dlm_thread,4143,2):dlm_flush_asts:599 ERROR: status = -112 > > (o2net,4052,0):o2net_connect_expired:1659 ERROR: no connection > > established with node 1 after 60.0 seconds, giving up and returning > > errors. > > (o2hb-E4FCE3DA14,4121,1):o2dlm_eviction_cb:267 o2dlm has evicted > > node 1 from group E4FCE3DA14E94515B327E8558F4641DA > > (ocfs2rec,13693,10):ocfs2_replay_journal:1605 Recovering node 1 from > > slot 1 on device (8,16) > > (dlm_reco_thread,4144,0):dlm_get_lock_resource:836 > > E4FCE3DA14E94515B327E8558F4641DA:$RECOVERY: at least one node (1) to > > recover before lock mastery can begin > > (dlm_reco_thread,4144,0):dlm_get_lock_resource:870 > > E4FCE3DA14E94515B327E8558F4641DA: recovery map is not empty, but > > must master $RECOVERY lock now > > (dlm_reco_thread,4144,0):dlm_do_recovery:523 (4144) Node 0 is the > > Recovery Master for the Dead Node 1 for Domain > > E4FCE3DA14E94515B327E8558F4641DA > > (ocfs2rec,13693,7):ocfs2_begin_quota_recovery:407 Beginning quota > > recovery in slot 1 > > (ocfs2_wq,4124,1):ocfs2_finish_quota_recovery:598 Finishing quota > > recovery in slot 1 > > ------------------------------------------------------------------------------- > > > > The processes hung at node#1 was more than 1 hour waiting for IO, > > so it seems to me that node#0 never replied the dlmlock shown on > > the first stack trace of node#1 process, but the message was queued > > to be sent, as show on node#0 dmesg when node#1 rebooted. The > > question is what could prevent dlm on node#0 replying the locking > > request from node#1? > > > > We have made copies of all debug files from /sys/kernel/debug on > > node#1 before rebooting it and we can provide them if needed (9.3MiB > > compressed data). > > > > Regards, -- .:''''':. .:' ` Sérgio Surkamp | Gerente de Rede :: ........ ser...@gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florianópolis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users