On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai <osanai.hisa...@jp.fujitsu.com> wrote: > > Hello, > > I have three nodes cluster using pacemaker/corosync. When I reboot one node, > > the node unable to join cluster. I can see that kind of split brain 10-20% > (recall ration) if I shutdown a node. > > What do you think of this problem?
It depends whether corosync sees all three nodes (in which case its a pacemaker problem), if not its a corosync problem. There are newer versions of both, perhaps try an upgrade? > > My questions are: > - Is this known problem? > - Any work around to avoid the this? > - How can I solve this problem? > > [testserver001] > ============ > Last updated: Sat Mar 10 14:18:49 2012 > Stack: openais > Current DC: NONE > 3 Nodes configured, 3 expected votes > 4 Resources configured. > ============ > > OFFLINE: [ testserver001 testserver002 testserver003 ] > > > Migration summary: > > [testserver002] > ============ > Last updated: Sat Mar 10 14:15:17 2012 > Stack: openais > Current DC: testserver002 - partition with quorum > Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 > 3 Nodes configured, 3 expected votes > 4 Resources configured. > ============ > > Online: [ testserver002 testserver003 ] > OFFLINE: [ testserver001 ] > > Resource Group: testgroup > testrsc (lsb:testmgr): Started testserver002 > stonith-testserver002 (stonith:external/ipmi): Started > testserver003 > stonith-testserver003 (stonith:external/ipmi): Started > testserver002 > stonith-testserver001 (stonith:external/ipmi): Started > testserver003 > > Migration summary: > * Node testserver003: > * Node testserver002: > > [testserver003] > ============ > Last updated: Sat Mar 10 14:19:07 2012 > Stack: openais > Current DC: testserver002 - partition with quorum > Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 > 3 Nodes configured, 3 expected votes > 4 Resources configured. > ============ > > Online: [ testserver002 testserver003 ] > OFFLINE: [ testserver001 ] > > Resource Group: testgroup > testrsc (lsb:testmgr): Started testserver002 > stonith-testserver002 (stonith:external/ipmi): Started > testserver003 > stonith-testserver003 (stonith:external/ipmi): Started > testserver002 > stonith-testserver001 (stonith:external/ipmi): Started > testserver003 > > Migration summary: > * Node testserver003: > * Node testserver002: > > - Checked information > + https://bugzilla.redhat.com/show_bug.cgi?id=525589 > It looks the packages which I used already support this. > + http://comments.gmane.org/gmane.linux.highavailability.user/36101 > I checked entries in /etc/hosts but I didn't find out the wrong entry. > === > 127.0.0.1 testserver001 localhost > ::1 localhost6.localdomain6 localhost6 > === > > - Look into this from tcpdump > OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends > MESSAGE_TYPE_MCAST. > I took the information from VMware env. > > + MESSAGE_TYPE_ORF_TOKEN > No. Time Source Destination > Protocol Length Info > 119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2 > UDP 112 Source port: 23489 Destination port: 23490 > > Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) > Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst: > Vmware_8e:74:92 (00:0c:29:8e:74:92) > Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst: > 172.27.4.2 (172.27.4.2) > User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 > (23490) > Data (70 bytes) > > 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00 > .."............. > 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b > ................ > (snip) > > + MESSAGE_TYPE_MCAST > No. Time Source Destination > Protocol Length Info > 5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16 > UDP 1486 Source port: 23489 Destination port: 23490 > > Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured > (11888 bits) > Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst: > IPv4mcast_5e:10:10 (01:00:5e:5e:10:10) > Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: > 226.94.16.16 (226.94.16.16) > User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 > (23490) > Data (1444 bytes) > > 0000 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b > .."............. > 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b > ................ > (snip) > > NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see > the > message in pacemaker.log. > > + MESSAGE_TYPE_ORF_TOKEN > No. Time Source Destination > Protocol Length Info > 39605 2012-03-10 14:18:13.826778 172.27.4.2 172.27.4.3 > UDP 112 Source port: 23489 Destination port: 23490 > > Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896 > bits) > Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst: > FujitsuT_97:8d:15 (00:19:99:97:8d:15) > Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: > 172.27.4.3 (172.27.4.3) > User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 > (23490) > Data (70 bytes) > > 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00 > .."............. > 0010 ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b > ................ > (snip) > > + pacemaker.log > Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped: > Election Trigger (I_DC_TIMEOUT) just popped! > Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input > I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING > Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition: > State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped: > Election Timeout (I_ELECTION_DC) just popped! > Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition: > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control: > Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae > Mar 10 14:22:09 testserver001 crmd: [7551]: WARN: > cib_client_add_notify_callback: Callback already present > Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions: > Setting custom graph functions > Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph: > Unpacked transition -1: 0 actions in 0 synapses > Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover: > Taking over DC status for this partition > Mar 10 14:22:09 testserver001 cib: [7547]: info: > cib_process_readwrite: We are now in R/W mode > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_master for section 'all' (origin=local/crmd/6, > version=0.143.0): ok (rc=0) > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_modify for section cib (origin=local/crmd/7, > version=0.143.0): ok (rc=0) > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_modify for section crm_config > (origin=local/crmd/9, version=0.143.0): ok (rc=0) > Mar 10 14:22:09 testserver001 crmd: [7551]: info: > do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks > Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch: > Membership 516: quorum still lost > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_modify for section crm_config > (origin=local/crmd/11, version=0.143.0): ok (rc=0) > Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch: > Setting expected votes to 3 > Mar 10 14:22:09 testserver001 crmd: [7551]: info: > config_query_callback: Checking for expired actions every 900000ms > Mar 10 14:22:09 testserver001 crmd: [7551]: info: > config_query_callback: Sending expected-votes=3 to corosync > Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch: > Membership 516: quorum still lost > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_modify for section crm_config > (origin=local/crmd/14, version=0.143.0): ok (rc=0) > Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch: > Setting expected votes to 3 > Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith: > Attempting connection to fencing daemon... > Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: > Operation complete: op cib_modify for section crm_config > (origin=local/crmd/16, version=0.143.0): ok (rc=0) > Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith: > Connected > > + enum message_type { > MESSAGE_TYPE_ORF_TOKEN = 0, /* Ordering, Reliability, > Flow (ORF) control Token */ > MESSAGE_TYPE_MCAST = 1, /* ring ordered multicast > message */ > MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there > are available rings */ > MESSAGE_TYPE_MEMB_JOIN = 3, /* membership join message > */ > MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token > */ > MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of > the token */ > }; > > - packages on CentOS 5.6 > + pacemaker-1.0.10-1.4.el5 > + corosync-1.2.5-1.3.el5 > > Thank you in advance, > Hisashi Osanai > > Hisashi Osanai (osanai.hisa...@jp.fujitsu.com) > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org