Re: [Pacemaker] unable to join cluster

Andrew Beekhof Wed, 28 Mar 2012 23:15:03 -0700

On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai
<osanai.hisa...@jp.fujitsu.com> wrote:
>
> Hello,
>
> I have three nodes cluster using pacemaker/corosync. When I reboot one node,
>
> the node unable to join cluster. I can see that kind of split brain 10-20%
> (recall ration) if I shutdown a node.
>
> What do you think of this problem?


It depends whether corosync sees all three nodes (in which case its a
pacemaker problem), if not its a corosync problem.
There are newer versions of both, perhaps try an upgrade?

>
> My questions are:
> - Is this known problem?
> - Any work around to avoid the this?
> - How can I solve this problem?
>
> [testserver001]
> ============
> Last updated: Sat Mar 10 14:18:49 2012
> Stack: openais
> Current DC: NONE
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> OFFLINE: [ testserver001 testserver002 testserver003 ]
>
>
> Migration summary:
>
> [testserver002]
> ============
> Last updated: Sat Mar 10 14:15:17 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
>  Resource Group: testgroup
>     testrsc     (lsb:testmgr):   Started testserver002
> stonith-testserver002        (stonith:external/ipmi):        Started
> testserver003
> stonith-testserver003        (stonith:external/ipmi):        Started
> testserver002
> stonith-testserver001        (stonith:external/ipmi):        Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> [testserver003]
> ============
> Last updated: Sat Mar 10 14:19:07 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> ============
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
>  Resource Group: testgroup
>     testrsc     (lsb:testmgr):   Started testserver002
> stonith-testserver002        (stonith:external/ipmi):        Started
> testserver003
> stonith-testserver003        (stonith:external/ipmi):        Started
> testserver002
> stonith-testserver001        (stonith:external/ipmi):        Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> - Checked information
>  + https://bugzilla.redhat.com/show_bug.cgi?id=525589
>    It looks the packages which I used already support this.
>  + http://comments.gmane.org/gmane.linux.highavailability.user/36101
>    I checked entries in /etc/hosts but I didn't find out the wrong entry.
>    ===
>    127.0.0.1 testserver001 localhost
>    ::1             localhost6.localdomain6 localhost6
>    ===
>
> - Look into this from tcpdump
>  OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
> MESSAGE_TYPE_MCAST.
>           I took the information from VMware env.
>
>    + MESSAGE_TYPE_ORF_TOKEN
>      No.     Time                       Source                Destination
> Protocol Length Info
>          119 2012-03-19 22:00:15.250310 172.27.4.1            172.27.4.2
> UDP      112    Source port: 23489  Destination port: 23490
>
>      Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
>      Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
> Vmware_8e:74:92 (00:0c:29:8e:74:92)
>      Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
> 172.27.4.2 (172.27.4.2)
>      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
>      Data (70 bytes)
>
>      0000  00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
> ..".............
>      0010  00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
> ................
>      (snip)
>
>    + MESSAGE_TYPE_MCAST
>      No.     Time                       Source                Destination
> Protocol Length Info
>         5141 2012-03-19 22:01:19.198346 172.27.4.2            226.94.16.16
> UDP      1486   Source port: 23489  Destination port: 23490
>
>      Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
> (11888 bits)
>      Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
> IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
>      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
> 226.94.16.16 (226.94.16.16)
>      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
>      Data (1444 bytes)
>
>      0000  01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
> ..".............
>      0010  04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
> ................
>      (snip)
>
>  NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
> the
>           message in pacemaker.log.
>
>    + MESSAGE_TYPE_ORF_TOKEN
>      No.     Time                       Source                Destination
> Protocol Length Info
>         39605 2012-03-10 14:18:13.826778 172.27.4.2            172.27.4.3
> UDP      112    Source port: 23489  Destination port: 23490
>
>      Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896
> bits)
>      Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst:
> FujitsuT_97:8d:15 (00:19:99:97:8d:15)
>      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
> 172.27.4.3 (172.27.4.3)
>      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
>      Data (70 bytes)
>
>      0000  00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00
> ..".............
>      0010  ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b
> ................
>      (snip)
>
>    + pacemaker.log
>      Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped:
> Election Trigger (I_DC_TIMEOUT) just popped!
>      Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>      Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition:
> State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>      Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped:
> Election Timeout (I_ELECTION_DC) just popped!
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition:
> State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control:
> Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae
>      Mar 10 14:22:09 testserver001 crmd: [7551]: WARN:
> cib_client_add_notify_callback: Callback already present
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions:
> Setting custom graph functions
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph:
> Unpacked transition -1: 0 actions in 0 synapses
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover:
> Taking over DC status for this partition
>      Mar 10 14:22:09 testserver001 cib: [7547]: info:
> cib_process_readwrite: We are now in R/W mode
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_master for section 'all' (origin=local/crmd/6,
> version=0.143.0): ok (rc=0)
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section cib (origin=local/crmd/7,
> version=0.143.0): ok (rc=0)
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/9, version=0.143.0): ok (rc=0)
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
> Membership 516: quorum still lost
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/11, version=0.143.0): ok (rc=0)
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
> Setting expected votes to 3
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> config_query_callback: Checking for expired actions every 900000ms
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
> config_query_callback: Sending expected-votes=3 to corosync
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
> Membership 516: quorum still lost
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/14, version=0.143.0): ok (rc=0)
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
> Setting expected votes to 3
>      Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith:
> Attempting connection to fencing daemon...
>      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
> Operation complete: op cib_modify for section crm_config
> (origin=local/crmd/16, version=0.143.0): ok (rc=0)
>      Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith:
> Connected
>
>    + enum message_type {
>              MESSAGE_TYPE_ORF_TOKEN = 0,         /* Ordering, Reliability,
> Flow (ORF) control Token */
>              MESSAGE_TYPE_MCAST = 1,             /* ring ordered multicast
> message */
>              MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there
> are available rings */
>              MESSAGE_TYPE_MEMB_JOIN = 3,         /* membership join message
> */
>              MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token
> */
>              MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of
> the token */
>      };
>
> - packages on CentOS 5.6
>  + pacemaker-1.0.10-1.4.el5
>  + corosync-1.2.5-1.3.el5
>
> Thank you in advance,
> Hisashi Osanai
>
> Hisashi Osanai (osanai.hisa...@jp.fujitsu.com)
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] unable to join cluster

Reply via email to