On Mon, Mar 19, 2012 at 10:53 AM, Roman Sidler <roman.sid...@sidler-se.net> wrote: > Hi all > No, we have no crossover cable between the nodes. > The 2 nodes are linked by a switched network, and this works really fine > except > the mentioned case. > > It's rather easy to reproduce. > 1. Activate a 2-node cluster > 2. disconnect network connection (e.g by disconnect network adapter on VM) > 3. wait until both nodes are active and act as DC > 4. reconnect nodes > 5. the new DC is elected > > When step 4 encounters an unsteady network, sometimes step 5 will never be > reached and both nodes stays DC. They're are sending and receiving heartbeat > status messages.
The CCM (part of heartbeat) is known not to be particularly awesome in such scenarios and it appears to be the case again here: Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 As far as pacemaker knows, its still all alone. Perhaps give corosync 1.4.x a go instead of heartbeat? > > The situation may be simulated by some repeatedly disconnects/connects > > Versions: > pacemaker 1.1.6 (and 1.0.5) > heartbeat 3.0.7 (and 3.0.0) > > Are there any ideas? > Or have somebody any hints how I could localize or debug this problem. > > Thanks! > Kind regards > Roman > > > > The logouts are everytime a little different however 2 examples here > > *** pacemaker 1.0.5 *** > Mar 8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning > after partition. > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster > partitions, See URL: http://linux-ha.org/SplitBrain > Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too > small. > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on > tuning > deadtime. > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux- > ha.org/FAQ#heavy_load > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up. > Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13: > interval 244130 ms > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13: > status active > Mar 8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status > update: Node lab13 now has status [active] (DC=true) > Mar 8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is > now > online > Mar 8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding > cib_apply_diff message (14a33) from lab13: not in our membership > Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused > Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster > Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info > Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15, > nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=15) > Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 > Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19 > [nodeid=1, born=15] > Mar 8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting > the > list of configured nodes > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM > event=NEW MEMBERSHIP (id=15) > Mar 8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed > Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster > Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM > event=NEW MEMBERSHIP (id=16) > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM > event=NEW MEMBERSHIP (id=17) > Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info > Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16, > nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > : > > *** pacemaker 1.1.6 *** > Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning > after > partition. > Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster > partitions, See URL: http://linux-ha.org/wiki/Split_Brain > Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small. > Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on > tuning > deadtime. > Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux- > ha.org/wiki/FAQ#Heavy_Load > Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up. > Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3: > interval 73300 ms > Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3: > status active > Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status > update: Node oan3 now has status [active] > Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now > online > Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for > shutdown > action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2 > Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of > a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched > Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph: > te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, > id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node > failure > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition > S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster > nodes are eligible to run resources. > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting > the > current CIB: S_POLICY_ENGINE > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the > PE: > query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1 > Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave > rsc_clustercontrol (Started oan1) > Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rscIpaddr2 > (Started oan1) > Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave > rsc_asterisk > (Started oan1) > Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_tomcat5 > (Started oan1) > Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave > rsc_ntp_refresh (Started oan1) > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE > origin=handle_response ] > Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition > 16: 0 > actions in 0 synapses > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16 > (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2 > Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition > 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2 > Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph: > ==================================================== > Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 > (Complete=0, > Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input- > 201.bz2): Complete > Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is > now > complete > Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: > done > - <null> > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine > Recheck Timer > Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 -> > 0.210.57 not applied to 0.210.50: current "num_updates" is less than required > Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not > requesting > full refresh in R/W mode > Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster > Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info > Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM > event=NEW MEMBERSHIP (id=14) > Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info > Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=14) > Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 > [nodeid=0, born=14] > Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the > list of configured nodes > Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster > Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info > Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info > Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, > nodes=1, > new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM > event=NEW MEMBERSHIP (id=15) > Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=15) > Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: > trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 > Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 > [nodeid=0, born=15] > Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation > complete: > op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok > (rc=0) > Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the > list of configured nodes > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org