Hi all No, we have no crossover cable between the nodes. The 2 nodes are linked by a switched network, and this works really fine except the mentioned case.
It's rather easy to reproduce. 1. Activate a 2-node cluster 2. disconnect network connection (e.g by disconnect network adapter on VM) 3. wait until both nodes are active and act as DC 4. reconnect nodes 5. the new DC is elected When step 4 encounters an unsteady network, sometimes step 5 will never be reached and both nodes stays DC. They're are sending and receiving heartbeat status messages. The situation may be simulated by some repeatedly disconnects/connects Versions: pacemaker 1.1.6 (and 1.0.5) heartbeat 3.0.7 (and 3.0.0) Are there any ideas? Or have somebody any hints how I could localize or debug this problem. Thanks! Kind regards Roman The logouts are everytime a little different however 2 examples here *** pacemaker 1.0.5 *** Mar 8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning after partition. Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too small. Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on tuning deadtime. Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux- ha.org/FAQ#heavy_load Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up. Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13: interval 244130 ms Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13: status active Mar 8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status update: Node lab13 now has status [active] (DC=true) Mar 8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is now online Mar 8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding cib_apply_diff message (14a33) from lab13: not in our membership Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=15) Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP: trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19 [nodeid=1, born=15] Mar 8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting the list of configured nodes Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=15) Mar 8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=16) Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=17) Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 : *** pacemaker 1.1.6 *** Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning after partition. Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too small. Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on tuning deadtime. Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux- ha.org/wiki/FAQ#Heavy_Load Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up. Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3: interval 73300 ms Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3: status active Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status update: Node oan3 now has status [active] Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is now online Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for shutdown action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2 Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph: te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node failure Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting the current CIB: S_POLICY_ENGINE Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the PE: query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1 Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_clustercontrol (Started oan1) Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rscIpaddr2 (Started oan1) Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_asterisk (Started oan1) Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_tomcat5 (Started oan1) Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rsc_ntp_refresh (Started oan1) Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition 16: 0 actions in 0 synapses Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1331858128-123) derived from /var/lib/pengine/pe-input-201.bz2 Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2 Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph: ==================================================== Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input- 201.bz2): Complete Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is now complete Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: done - <null> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting PEngine Recheck Timer Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 -> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not requesting full refresh in R/W mode Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=14) Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=14) Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 [nodeid=0, born=14] Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the list of configured nodes Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=15) Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=15) Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 [nodeid=0, born=15] Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok (rc=0) Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting the list of configured nodes _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org