Re: [Ocfs2-users] 2 node OCFS2 clusters

Srinivas Eeda Tue, 17 Nov 2009 08:07:46 -0800

What do you mean by loss of file system? Are you referring to the timewhen file system is hung? If so, that is unavoidable. They have to talkto each other whenever a file system metadata is changing. When networkwent down, they were waiting for it to come back. Since it was takinglonger, quorum logic kicked in and took down one of the node(node withhigher node number). If network was backup before the quorum kicked in,they would have been fine.


May I ask what your expectations are when the network goes down.


thanks,
--Srini

Thompson, Mark wrote:

Hi,
I have done some more tests today, and I observed the following:
Test 1:

node 0 - ifdown eth2

node 0 - OCFS2 filesystem stalls on both nodes

node 1 - Decides to reboot
node 0 - Resumes OCFS2 service (while still off the network) OCFS2filesystem back online
node 1 - Cannot re-join cluster as node 0 is off the network and hasthe fs lock (Transport endpoint error)
node 0 - ifup eth2

node 1 - Re-joins the clusters and re-mounts OCFS2 filesystem.
Test 2:

node 1 - ifdown eth2

node 0 - OCFS2 filesystem stalls on both nodes

node 1  - Decides to reboot

node 0 - Resumes OCFS2 service, OCFS2 filesystem back online

node 1 -- Boots up, re-joins cluster and re-mounts OCFS2 filesystem.
Is this the expected behaviour? And if it is, is there anything we cando avoid the loss of the OCFS2 filesystems?
Here's the messages file outputs.
Test 1 - Node 0
Nov 17 11:00:26 my_node0 kernel: ocfs2: Unmounting device (253,9) on(node 0)
Nov 17 11:02:21 my_node0 modprobe: FATAL: Module ocfs2_stackglue notfound.
Nov 17 11:02:21 my_node0 kernel: OCFS2 Node Manager 1.4.4 Tue Sep 811:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)
Nov 17 11:02:21 my_node0 kernel: OCFS2 DLM 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:02:21 my_node0 kernel: OCFS2 DLMFS 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:02:21 my_node0 kernel: OCFS2 User DLM kernel interface loaded
Nov 17 11:02:46 my_node0 kernel: OCFS2 1.4.4 Tue Sep 8 11:56:43 PDT2009 (build 3a5bffa75b910d5bcdd5c607c4394b1e)
Nov 17 11:02:46 my_node0 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0
Nov 17 11:02:46 my_node0 kernel: ocfs2: Mounting device (253,9) on(node 0, slot 0) with ordered data mode.
Nov 17 11:02:59 my_node0 kernel: ocfs2_dlm: Node 1 joins domain21751145F96E45649324C9EEF5485248
Nov 17 11:02:59 my_node0 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Nov 17 11:07:51 my_node0 kernel: (15,1):dlm_do_master_request:1334ERROR: link to 1 went down!
Nov 17 11:07:51 my_node0 kernel: (15,1):dlm_get_lock_resource:917ERROR: status = -107
Nov 17 11:09:34 my_node0 kernel: (22108,1):ocfs2_dlm_eviction_cb:98device (253,9): dlm has evicted node 1
Nov 17 11:09:34 my_node0 kernel: (29443,1):dlm_get_lock_resource:84421751145F96E45649324C9EEF5485248:M000000000000000000001f96e7b609: atleast one node (1) to recover before lock mastery can begin
Nov 17 11:09:35 my_node0 kernel: (29443,1):dlm_get_lock_resource:89821751145F96E45649324C9EEF5485248:M000000000000000000001f96e7b609: atleast one node (1) to recover before lock mastery can begin
Nov 17 11:09:36 my_node0 kernel: (15,1):dlm_restart_lock_mastery:1223ERROR: node down! 1
Nov 17 11:09:36 my_node0 kernel: (15,1):dlm_wait_for_lock_mastery:1040ERROR: status = -11
Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:84421751145F96E45649324C9EEF5485248:$RECOVERY: at least one node (1) torecover before lock mastery can begin
Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:87821751145F96E45649324C9EEF5485248: recovery map is not empty, but mustmaster $RECOVERY lock now
Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_do_recovery:524 (22167)Node 0 is the Recovery Master for the Dead Node 1 for Domain21751145F96E45649324C9EEF5485248
Nov 17 11:09:46 my_node0 kernel: (29443,1):ocfs2_replay_journal:1183Recovering node 1 from slot 1 on device (253,9)
Nov 17 11:12:27 my_node0 kernel: ocfs2_dlm: Node 1 joins domain21751145F96E45649324C9EEF5485248
Nov 17 11:12:27 my_node0 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Test 1 -- Node 1
Nov 17 11:00:26 my_node1 kernel: ocfs2_dlm: Node 0 leaves domain21751145F96E45649324C9EEF5485248
Nov 17 11:00:26 my_node1 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 1
Nov 17 11:00:46 my_node1 kernel: ocfs2: Unmounting device (253,9) on(node 1)
Nov 17 11:02:30 my_node1 modprobe: FATAL: Module ocfs2_stackglue notfound.
Nov 17 11:02:30 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep 811:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)
Nov 17 11:02:30 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:02:30 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:02:30 my_node1 kernel: OCFS2 User DLM kernel interface loaded
Nov 17 11:02:59 my_node1 kernel: OCFS2 1.4.4 Tue Sep 8 11:56:43 PDT2009 (build 3a5bffa75b910d5bcdd5c607c4394b1e)
Nov 17 11:02:59 my_node1 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Nov 17 11:02:59 my_node1 kernel: ocfs2: Mounting device (253,9) on(node 1, slot 1) with ordered data mode.
Nov 17 11:07:27 my_node1 kernel:(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -112
Nov 17 11:07:27 my_node1 kernel: (7351,3):dlm_wait_for_node_death:37021751145F96E45649324C9EEF5485248: waiting 5000ms for notification ofdeath of node 0
Nov 17 11:07:57 my_node1 kernel:(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -107
Nov 17 11:07:57 my_node1 kernel: (7351,3):dlm_wait_for_node_death:37021751145F96E45649324C9EEF5485248: waiting 5000ms for notification ofdeath of node 0
Nov 17 11:08:27 my_node1 kernel: (15,1):dlm_do_master_request:1334ERROR: link to 0 went down!
Nov 17 11:08:27 my_node1 kernel:(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -107
Nov 17 11:08:27 my_node1 kernel: (7351,3):dlm_wait_for_node_death:37021751145F96E45649324C9EEF5485248: waiting 5000ms for notification ofdeath of node 0
Nov 17 11:08:27 my_node1 kernel: (15,1):dlm_get_lock_resource:917ERROR: status = -107
Nov 17 11:11:31 my_node1 modprobe: FATAL: Module ocfs2_stackglue notfound.
Nov 17 11:11:32 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep 811:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)
Nov 17 11:11:32 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:11:32 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:11:32 my_node1 kernel: OCFS2 User DLM kernel interface loaded
Nov 17 11:11:40 my_node1 kernel: OCFS2 1.4.4 Tue Sep 8 11:56:43 PDT2009 (build 3a5bffa75b910d5bcdd5c607c4394b1e)
Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_request_join:1036 ERROR:status = -107
Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_try_to_join_domain:1210ERROR: status = -107
Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_join_domain:1488 ERROR:status = -107
Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_register_domain:1754ERROR: status = -107
Nov 17 11:12:06 my_node1 kernel: (6282,0):ocfs2_dlm_init:2723 ERROR:status = -107
Nov 17 11:12:06 my_node1 kernel: (6282,0):ocfs2_mount_volume:1437ERROR: status = -107
Nov 17 11:12:06 my_node1 kernel: ocfs2: Unmounting device (253,9) on(node 1)
Nov 17 11:12:27 my_node1 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Nov 17 11:12:27 my_node1 kernel: ocfs2: Mounting device (253,9) on(node 1, slot 1) with ordered data mode.
Test 2 -- Node 0
Nov 17 11:16:37 my_node0 kernel: (22166,3):dlm_send_proxy_ast_msg:458ERROR: status = -107
Nov 17 11:16:37 my_node0 kernel: (22166,3):dlm_flush_asts:600 ERROR:status = -107
Nov 17 11:17:35 my_node0 kernel: (22108,1):ocfs2_dlm_eviction_cb:98device (253,9): dlm has evicted node 1
Nov 17 11:17:35 my_node0 kernel: (6515,1):ocfs2_replay_journal:1183Recovering node 1 from slot 1 on device (253,9)
Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:84421751145F96E45649324C9EEF5485248:$RECOVERY: at least one node (1) torecover before lock mastery can begin
Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:87821751145F96E45649324C9EEF5485248: recovery map is not empty, but mustmaster $RECOVERY lock now
Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_do_recovery:524 (22167)Node 0 is the Recovery Master for the Dead Node 1 for Domain21751145F96E45649324C9EEF5485248
Nov 17 11:19:31 my_node0 kernel: ocfs2_dlm: Node 1 joins domain21751145F96E45649324C9EEF5485248
Nov 17 11:19:31 my_node0 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Test2 -- Node 1
Nov 17 11:19:22 my_node1 modprobe: FATAL: Module ocfs2_stackglue notfound.
Nov 17 11:19:23 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep 811:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)
Nov 17 11:19:23 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:19:23 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep 8 11:56:46PDT 2009 (build e6e41b84c785deeea891e5873dbf19ab)
Nov 17 11:19:23 my_node1 kernel: OCFS2 User DLM kernel interface loaded
Nov 17 11:19:31 my_node1 kernel: OCFS2 1.4.4 Tue Sep 8 11:56:43 PDT2009 (build 3a5bffa75b910d5bcdd5c607c4394b1e)
Nov 17 11:19:31 my_node1 kernel: ocfs2_dlm: Nodes in domain("21751145F96E45649324C9EEF5485248"): 0 1
Nov 17 11:19:31 my_node1 kernel: ocfs2: Mounting device (253,8) on(node 1, slot 1) with ordered data mode.
Regards,
Mark
*From:* Srinivas Eeda [mailto:[email protected]]
*Sent:* 16 November 2009 16:05
*To:* Thompson, Mark
*Cc:* [email protected]
*Subject:* Re: [Ocfs2-users] 2 node OCFS2 clusters
Thompson, Mark wrote:

Hi Srini,
Thanks for the response.
So are the following statements correct:
If I stop the networking on node 1, node 0 will continue to allowOCFS2 filesystems to work and not reboot itself.
If I stop the networking on node 0, node 1 (now being the lowestnode?) will continue to allow OCFS2 filesystems to work and not rebootitself.
In both the cases node 0 will survive, because that's the node thathas lowest node number (defined in cluster.conf). This applies to thescenario where interconnect went down but nodes are healthy and areheartbeating to the disk.
I guess I just need to know if it's possible to have a 2 node OCFS2cluster that will cope with either one of the nodes dying, and havethe remaining node still provide service.
If node 0 itself panics, reboots then node 1 will survive.
Regards,
Mark
*From:* Srinivas Eeda [mailto:[email protected]]
*Sent:* 16 November 2009 14:57
*To:* Thompson, Mark
*Cc:* [email protected] <mailto:[email protected]>
*Subject:* Re: [Ocfs2-users] 2 node OCFS2 clusters
In a cluster with more than 2 nodes, if a network on one node goesdown, that node will evict itself but other nodes will survive. But ina two node cluster, the node with lowest node number will survive nomater on which node network went down.
thanks,
--Srini

Thompson, Mark wrote:

Hi,

This is my first post here so please be gentle with me.
My question is, can you have a 2 node OCFS2 cluster, disconnect onenode from the network, and have the remaining node continue tofunction normally? Currently we have a 2 node cluster and if we stopthe NIC that has the OCFS2 o2cb net connection running on it, theother node will reboot itself. I have researched having a 2 node OCFS2cluster but so far I have been unable to find a clear solution. I havelooked at the FAQ regarding quorum, and my OCFS2 init scripts areenabled etc.
Is this possible, or should we look at alternative solutions?

Regards,

Mark
This e-mail has come from Experian, the only business to have beentwice named the UK's 'Business of the Year'
===================================================================================
Information in this e-mail and any attachments is confidential, andmay not be copied or used by anyone other than the addressee, nordisclosed to any third party without our permission. There is nointention to create any legally binding contract or other bindingcommitment through the use of this electronic communication unless itis issued in accordance with the Experian Limited standard terms andconditions of purchase or other express written agreement betweenExperian Limited and the recipient.
Although Experian has taken reasonable steps to ensure that thiscommunication and any attachments are free from computer virus, youare advised to take your own steps to ensure that they are actuallyvirus free.
Companies Act information:

Registered name: Experian Limited
Registered office: Landmark House, Experian Way, NG2 Business Park,Nottingham, NG80 1ZZ, United Kingdom
Place of registration: England and Wales

Registered number: 653331
------------------------------------------------------------------------
_______________________________________________
Ocfs2-users mailing list
[email protected] <mailto:[email protected]>
http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 2 node OCFS2 clusters

Reply via email to