Someone around the cubes mentioned using a tool called iozone to test disk i/o on these luns that are ocfs2. With the collective knowledge here, is this ok to do? Smart, not smart, useful, not useful? These stalled I/Os are hurting me too, but on storage that's not EMC... so this is also relevant to my current pains.
On Fri, Feb 5, 2010 at 3:16 AM, Mailing List SVR <li...@svrinformatica.it> wrote: > EMC installed statistics collection tool but until today nothing strange was > observed and the hangs continue ... > > I was forced to install a watchdog and reboot the servers when touch <file> > doesn't complete after 30 seconds, > > the logs posted in the previous emails doesn't suggest any problem on ocfs2 > side? > > thanks > Nicola > > In data mercoledì 27 gennaio 2010 17:14:39, Mailing List SVR ha scritto: > : > Please note that at 15:29:30 I did a manual reboot on the node1, so the >> hang on the node2 was from 14:59:34 to 15:29:30, >> >> regards >> Nicola >> >> In data mercoledì 27 gennaio 2010 17:03:07, Mailing List SVR ha scritto: >> : > Hi Sunil, >> > >> > another hang here: >> > >> > - node2 (making heavy I/O operation at that time) was up, ping to it was >> > ok (following your suggestion I started a ping from node1 to node2 some >> > days ago and no packet was loss, so the switch are ok) but I cannot login >> > via ssh or from the console (hang) >> > - node1 was up with low load but was unable to write to the ocfs2 >> > partition (a simple touch <file> hang). >> > - after rebooting node1, node2 restarted to work and I was able to access >> > to it, >> > >> > I'm investigating with EMC for hardware problems too but this seems to me >> > an ocfs2 problem. >> > >> > Here are some logs: >> > >> > node1: >> > >> > Jan 27 14:59:33 nvr1-rc kernel: o2net: connection to node >> > nvr2-rc.minint.it (num 0) at 1.1.1.6:7777 has been idle for 35.0 seconds, >> > shutting it down. Jan 27 14:59:33 nvr1-rc kernel: >> > (0,7):o2net_idle_timer:1503 here are some times that might help debug the >> > situation: (tmr 1264600738.986852 now 1264600773.986647 dr >> > 1264600738.989741 adv >> > 1264600738.989753:1264600738.989754 func (75554989:507) >> > 1264600736.951519:1264600736.951521) >> > Jan 27 14:59:33 nvr1-rc kernel: o2net: no longer connected to node nvr2- >> > rc.minint.it (num 0) at 1.1.1.6:7777 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -112 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -112 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_drop_lockres_ref:2211 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_purge_lockres:206 ERROR: >> > status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_send_proxy_ast_msg:458 >> > ERROR: status = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: (4025,5):dlm_flush_asts:600 ERROR: status >> > = -107 >> > Jan 27 14:59:33 nvr1-rc kernel: o2net: connected to node >> > nvr2-rc.minint.it (num 0) at 1.1.1.6:7777 >> > >> > node2: >> > >> > Jan 27 14:59:34 nvr2-rc kernel: o2net: no longer connected to node nvr1- >> > rc.minint.it (num 1) at 1.1.1.5:7777 >> > Jan 27 14:59:34 nvr2-rc kernel: (4000,4):dlm_drop_lockres_ref:2211 ERROR: >> > status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: >> > (6208,5):dlm_send_remote_unlock_request:359 ERROR: status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: (3344,3):dlm_do_master_request:1334 >> > ERROR: link to 1 went down! >> > Jan 27 14:59:34 nvr2-rc kernel: (6056,1):dlm_do_master_request:1334 >> > ERROR: link to 1 went down! >> > Jan 27 14:59:34 nvr2-rc kernel: (6056,1):dlm_get_lock_resource:917 ERROR: >> > status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: (3344,3):dlm_get_lock_resource:917 ERROR: >> > status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: (4000,4):dlm_purge_lockres:206 ERROR: >> > status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: >> > (5214,4):dlm_send_remote_unlock_request:359 ERROR: status = -112 >> > Jan 27 14:59:34 nvr2-rc kernel: o2net: accepted connection from node >> > nvr1- rc.minint.it (num 1) at 1.1.1.5:7777 >> > Jan 27 15:29:30 nvr2-rc kernel: o2net: connection to node >> > nvr1-rc.minint.it (num 1) at 1.1.1.5:7777 has been idle for 35.0 seconds, >> > shutting it down. Jan 27 15:29:30 nvr2-rc kernel: >> > (0,5):o2net_idle_timer:1503 here are some times that might help debug the >> > situation: (tmr 1264602535.751483 now 1264602570.751723 dr >> > 1264602535.751469 adv >> > 1264602535.751491:1264602535.751491 func (75554989:505) >> > 1264602517.753992:1264602517.753999) >> > Jan 27 15:29:30 nvr2-rc kernel: o2net: no longer connected to node nvr1- >> > rc.minint.it (num 1) at 1.1.1.5:7777 >> > Jan 27 15:30:05 nvr2-rc kernel: (3943,1):o2net_connect_expired:1664 >> > ERROR: no connection established with node 1 after 35.0 seconds, giving >> > up and returning errors. >> > Jan 27 15:30:40 nvr2-rc kernel: (3943,5):o2net_connect_expired:1664 >> > ERROR: no connection established with node 1 after 35.0 seconds, giving >> > up and returning errors. >> > Jan 27 15:30:58 nvr2-rc kernel: (3985,1):ocfs2_dlm_eviction_cb:98 device >> > (120,1): dlm has evicted node 1 >> > Jan 27 15:30:58 nvr2-rc kernel: (6983,5):dlm_get_lock_resource:844 >> > 3AE0B7F3BAB749D09D37DAE16FA38042:M00000000000000000000122d757fa5: at >> > least one node (1) to recover before lock mastery can begin >> > Jan 27 15:30:58 nvr2-rc kernel: (6056,1):dlm_restart_lock_mastery:1223 >> > ERROR: node down! 1 >> > Jan 27 15:30:58 nvr2-rc kernel: (6056,1):dlm_wait_for_lock_mastery:1040 >> > ERROR: status = -11 >> > Jan 27 15:30:58 nvr2-rc kernel: (3344,3):dlm_restart_lock_mastery:1223 >> > ERROR: node down! 1 >> > Jan 27 15:30:58 nvr2-rc kernel: (3344,3):dlm_wait_for_lock_mastery:1040 >> > ERROR: status = -11 >> > Jan 27 15:30:59 nvr2-rc kernel: (6983,5):dlm_get_lock_resource:898 >> > 3AE0B7F3BAB749D09D37DAE16FA38042:M00000000000000000000122d757fa5: at >> > least one node (1) to recover before lock mastery can begin >> > Jan 27 15:30:59 nvr2-rc kernel: (6056,1):dlm_get_lock_resource:898 >> > 3AE0B7F3BAB749D09D37DAE16FA38042:N00000000000201de: at least one node (1) >> > to recover before lock mastery can begin >> > Jan 27 15:30:59 nvr2-rc kernel: (3344,3):dlm_get_lock_resource:898 >> > 3AE0B7F3BAB749D09D37DAE16FA38042:M000000000000004dc281b900000000: at >> > least one node (1) to recover before lock mastery can begin >> > Jan 27 15:31:01 nvr2-rc kernel: (4001,6):dlm_get_lock_resource:844 >> > 3AE0B7F3BAB749D09D37DAE16FA38042:$RECOVERY: at least one node (1) to >> > recover before lock mastery can begin >> > Jan 27 15:31:01 nvr2-rc kernel: (4001,6):dlm_get_lock_resource:878 >> > 3AE0B7F3BAB749D09D37DAE16FA38042: recovery map is not empty, but must >> > master $RECOVERY lock now >> > Jan 27 15:31:01 nvr2-rc kernel: (4001,6):dlm_do_recovery:524 (4001) Node >> > 0 is the Recovery Master for the Dead Node 1 for Domain >> > 3AE0B7F3BAB749D09D37DAE16FA38042 >> > Jan 27 15:31:10 nvr2-rc kernel: (6983,1):ocfs2_replay_journal:1183 >> > Recovering node 1 from slot 0 on device (120,1) >> > >> > thanks >> > Nicola >> > >> > In data giovedì 14 gennaio 2010 21:13:15, Sunil Mushran ha scritto: >> > : > Mailing List SVR wrote: >> > : > >> > > > Hi, >> > > > >> > > > periodically one of on my two nodes cluster is fenced here are the >> > > > logs: >> > > > >> > > > Jan 14 07:01:44 nvr1-rc kernel: o2net: no longer connected to node >> > > > nvr2- rc.minint.it (num 0) at 1.1.1.6:7777 >> > > > Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_do_master_request:1334 >> > > > ERROR: link to 0 went down! >> > > > Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 >> > > > ERROR: status = -112 >> > > > Jan 14 07:01:44 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: >> > > > status = -112 >> > > > Jan 14 07:01:44 nvr1-rc kernel: (21534,1):dlm_get_lock_resource:917 >> > > > ERROR: status = -112 >> > > > Jan 14 07:02:19 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 >> > > > ERROR: no connection established with node 0 after 35.0 seconds, >> > > > giving up and returning errors. >> > > > Jan 14 07:02:54 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 >> > > > ERROR: no connection established with node 0 after 35.0 seconds, >> > > > giving up and returning errors. >> > > > Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_send_proxy_ast_msg:458 >> > > > ERROR: status = -107 >> > > > Jan 14 07:03:10 nvr1-rc kernel: (4007,4):dlm_flush_asts:600 ERROR: >> > > > status = -107 >> > > > Jan 14 07:03:29 nvr1-rc kernel: (3950,5):o2net_connect_expired:1664 >> > > > ERROR: no connection established with node 0 after 35.0 seconds, >> > > > giving up and returning errors. >> > > > Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2quo_make_decision:146 ERROR: >> > > > fencing this node because it is connected to a half-quorum of 1 out >> > > > of 2 nodes which doesn't include the lowest active node 0 >> > > > Jan 14 07:03:50 nvr1-rc kernel: (31,5):o2hb_stop_all_regions:1967 >> > > > ERROR: stopping heartbeat on all active regions. >> > > > >> > > > I'm sure there are no network connectivity problem but it is possible >> > > > that there are heavy IO loads, is this the intended behaviour? Why >> > > > under heavy load the loaded node is fenced? >> > > > >> > > > I'm using ocfs2-1.4.4 on rhel5 kernel-2.6.18-164.6.1.el5 >> > > >> > > So the network connection snapped. What it means is that the nodes >> > > could not ping each other for 35 seconds. In fact node 1 (this one), >> > > tried to reconnect to node 0 but got no reply back. So the network >> > > issue lasted for over 2 mins. >> > > >> > > Switch could be one culprit. See if the switch logs say something. >> > > Other possibility is that node 0 was paging heavily. Or kswapd was >> > > pegged at 100%. This is hard to determine after the fact. Something to >> > > keep in mind the next time you see the same issue. If that is the case, >> > > then that needs to be fixed. Maybe add more memory. Or, if you are >> > > running the database, ensure you are using hugepages. etc. >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users