Cannot say for sure. It could be a deadlock (bug) too. As in, I don't want to blame any one entity without knowing more.
If it were up to me, I'd start with the dlm. See which node holds the lock that others are waiting on. Then see why that node is unable to downconvert that lock. As in, if the lock has holders try to determine the pids holding that lock and see where they are stuck. In mainline you can do "cat /proc/PID/stack" to look at the stack of a PID. Marco wrote: > Hello, > > today I noticed the following on *only* one node: > > ----- cut here ----- > Apr 29 11:01:18 node06 kernel: [2569440.616036] INFO: task ocfs2_wq:5214 > blocked for more than 120 seconds. > Apr 29 11:01:18 node06 kernel: [2569440.616056] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 29 11:01:18 node06 kernel: [2569440.616080] ocfs2_wq D > 0000000000000002 0 5214 2 0x00000000 > Apr 29 11:01:18 node06 kernel: [2569440.616101] ffff88014fa63880 > 0000000000000046 ffffffffa01878a5 ffffffffa020f0fc > Apr 29 11:01:18 node06 kernel: [2569440.616131] 0000000000000000 > 000000000000f8a0 ffff88014baebfd8 00000000000155c0 > Apr 29 11:01:18 node06 kernel: [2569440.616161] 00000000000155c0 > ffff88014ca38e20 ffff88014ca39118 00000001a0187b86 > Apr 29 11:01:18 node06 kernel: [2569440.616192] Call Trace: > Apr 29 11:01:18 node06 kernel: [2569440.616223] [<ffffffffa01878a5>] ? > scsi_done+0x0/0xc [scsi_mod] > Apr 29 11:01:18 node06 kernel: [2569440.616245] [<ffffffffa020f0fc>] ? > qla2xxx_queuecommand+0x171/0x1de [qla2xxx] > Apr 29 11:01:18 node06 kernel: [2569440.616273] [<ffffffffa018d290>] ? > scsi_request_fn+0x429/0x506 [scsi_mod] > Apr 29 11:01:18 node06 kernel: [2569440.616291] [<ffffffffa02ab0a7>] ? > o2dlm_blocking_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb] > Apr 29 11:01:18 node06 kernel: [2569440.616317] [<ffffffffa02ab090>] ? > o2dlm_lock_ast_wrapper+0x0/0x17 [ocfs2_stack_o2cb] > Apr 29 11:01:18 node06 kernel: [2569440.616345] [<ffffffff812ee253>] ? > schedule_timeout+0x2e/0xdd > Apr 29 11:01:18 node06 kernel: [2569440.616362] [<ffffffff8118d99a>] ? > vsnprintf+0x40a/0x449 > Apr 29 11:01:18 node06 kernel: [2569440.616378] [<ffffffff812ee118>] ? > wait_for_common+0xde/0x14f > Apr 29 11:01:18 node06 kernel: [2569440.616396] [<ffffffff8104a188>] ? > default_wake_function+0x0/0x9 > Apr 29 11:01:18 node06 kernel: [2569440.616421] [<ffffffffa0fbac46>] ? > __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616445] [<ffffffff812ee517>] ? > out_of_line_wait_on_bit+0x6b/0x77 > Apr 29 11:01:18 node06 kernel: [2569440.616468] [<ffffffffa0fbe8ff>] ? > ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616497] [<ffffffffa0ffacc1>] ? > ocfs2_lock_global_qf+0x28/0x81 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616519] [<ffffffffa0ffacc1>] ? > ocfs2_lock_global_qf+0x28/0x81 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616540] [<ffffffffa0ffb3a3>] ? > ocfs2_acquire_dquot+0x8d/0x105 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616557] [<ffffffff812ee7b5>] ? > mutex_lock+0xd/0x31 > Apr 29 11:01:18 node06 kernel: [2569440.616574] [<ffffffff8112c2b2>] ? > dqget+0x2ce/0x318 > Apr 29 11:01:18 node06 kernel: [2569440.616589] [<ffffffff8112cbad>] ? > dquot_initialize+0x51/0x115 > Apr 29 11:01:18 node06 kernel: [2569440.616611] [<ffffffffa0fcaab8>] ? > ocfs2_delete_inode+0x0/0x1640 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616630] [<ffffffff810fee1f>] ? > generic_delete_inode+0xd7/0x168 > Apr 29 11:01:18 node06 kernel: [2569440.616652] [<ffffffffa0fca061>] ? > ocfs2_drop_inode+0xc0/0x123 [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616669] [<ffffffff810fdfa8>] ? > iput+0x27/0x60 > Apr 29 11:01:18 node06 kernel: [2569440.616689] [<ffffffffa0fd0a8f>] ? > ocfs2_complete_recovery+0x82b/0xa3f [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616715] [<ffffffff8106144b>] ? > worker_thread+0x188/0x21d > Apr 29 11:01:18 node06 kernel: [2569440.616736] [<ffffffffa0fd0264>] ? > ocfs2_complete_recovery+0x0/0xa3f [ocfs2] > Apr 29 11:01:18 node06 kernel: [2569440.616761] [<ffffffff81064a36>] ? > autoremove_wake_function+0x0/0x2e > Apr 29 11:01:18 node06 kernel: [2569440.616778] [<ffffffff810612c3>] ? > worker_thread+0x0/0x21d > Apr 29 11:01:18 node06 kernel: [2569440.616793] [<ffffffff81064769>] ? > kthread+0x79/0x81 > Apr 29 11:01:18 node06 kernel: [2569440.616810] [<ffffffff81011baa>] ? > child_rip+0xa/0x20 > Apr 29 11:01:18 node06 kernel: [2569440.616825] [<ffffffff810646f0>] ? > kthread+0x0/0x81 > Apr 29 11:01:18 node06 kernel: [2569440.616840] [<ffffffff81011ba0>] ? > child_rip+0x0/0x20 > ----- cut here ----- > > On all the others I had the following: > > ----- cut here ----- > Apr 29 11:00:23 node01 kernel: [2570880.752038] INFO: task o2quot/0:2971 > blocked for more than 120 seconds. > Apr 29 11:00:23 node01 kernel: [2570880.752059] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 29 11:00:23 node01 kernel: [2570880.752083] o2quot/0 D > 0000000000000000 0 2971 2 0x00000000 > Apr 29 11:00:23 node01 kernel: [2570880.752104] ffffffff814451f0 > 0000000000000046 0000000000000000 0000000000000002 > Apr 29 11:00:23 node01 kernel: [2570880.752134] ffff880249e28d20 > 000000000000f8a0 ffff88024cda3fd8 00000000000155c0 > Apr 29 11:00:23 node01 kernel: [2570880.752164] 00000000000155c0 > ffff88024ce4e9f0 ffff88024ce4ece8 000000004cda3a60 > Apr 29 11:00:23 node01 kernel: [2570880.752195] Call Trace: > Apr 29 11:00:23 node01 kernel: [2570880.752214] [<ffffffff812ee253>] ? > schedule_timeout+0x2e/0xdd > Apr 29 11:00:23 node01 kernel: [2570880.752233] [<ffffffff8110baff>] ? > __find_get_block+0x176/0x186 > Apr 29 11:00:23 node01 kernel: [2570880.752261] [<ffffffffa04fd29c>] ? > ocfs2_validate_quota_block+0x0/0x88 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752286] [<ffffffff812ee118>] ? > wait_for_common+0xde/0x14f > Apr 29 11:00:23 node01 kernel: [2570880.752304] [<ffffffff8104a188>] ? > default_wake_function+0x0/0x9 > Apr 29 11:00:23 node01 kernel: [2570880.752326] [<ffffffffa04bbc46>] ? > __ocfs2_cluster_lock+0x8a4/0x8c5 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752351] [<ffffffff81044e0e>] ? > find_busiest_group+0x3af/0x874 > Apr 29 11:00:23 node01 kernel: [2570880.752373] [<ffffffffa04bf8ff>] ? > ocfs2_inode_lock_full_nested+0x1a3/0xb2c [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752402] [<ffffffffa04fbcc1>] ? > ocfs2_lock_global_qf+0x28/0x81 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752424] [<ffffffffa04fbcc1>] ? > ocfs2_lock_global_qf+0x28/0x81 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752446] [<ffffffffa04fc8f8>] ? > ocfs2_sync_dquot_helper+0xca/0x300 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752474] [<ffffffffa04fc82e>] ? > ocfs2_sync_dquot_helper+0x0/0x300 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752500] [<ffffffff8112ce8e>] ? > dquot_scan_active+0x78/0xd0 > Apr 29 11:00:23 node01 kernel: [2570880.752521] [<ffffffffa04fbc2b>] ? > qsync_work_fn+0x24/0x42 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752539] [<ffffffff8106144b>] ? > worker_thread+0x188/0x21d > Apr 29 11:00:23 node01 kernel: [2570880.752559] [<ffffffffa04fbc07>] ? > qsync_work_fn+0x0/0x42 [ocfs2] > Apr 29 11:00:23 node01 kernel: [2570880.752576] [<ffffffff81064a36>] ? > autoremove_wake_function+0x0/0x2e > Apr 29 11:00:23 node01 kernel: [2570880.752593] [<ffffffff810612c3>] ? > worker_thread+0x0/0x21d > Apr 29 11:00:23 node01 kernel: [2570880.752608] [<ffffffff81064769>] ? > kthread+0x79/0x81 > Apr 29 11:00:23 node01 kernel: [2570880.752625] [<ffffffff81011baa>] ? > child_rip+0xa/0x20 > Apr 29 11:00:23 node01 kernel: [2570880.752640] [<ffffffff810646f0>] ? > kthread+0x0/0x81 > Apr 29 11:00:23 node01 kernel: [2570880.752655] [<ffffffff81011ba0>] ? > child_rip+0x0/0x20 > ----- cut here ----- > > By looking at the timestamps it seems that o2quot got stuck before > ocfs2_wq, but right now I can't guarantee that they are 100% exact... > > Am I right if I think it has been a hardware failure? > > Best regards, > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users