On Wed, Mar 21, 2012 at 8:46 AM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > On Sun, Mar 18, 2012 at 11:53:11PM +0000, Roman Sidler wrote: >> Hi all >> No, we have no crossover cable between the nodes. >> The 2 nodes are linked by a switched network, and this works really fine >> except >> the mentioned case. >> >> It's rather easy to reproduce. >> 1. Activate a 2-node cluster >> 2. disconnect network connection (e.g by disconnect network adapter on VM) >> 3. wait until both nodes are active and act as DC >> 4. reconnect nodes >> 5. the new DC is elected >> >> When step 4 encounters an unsteady network, sometimes step 5 will never be >> reached and both nodes stays DC. They're are sending and receiving heartbeat >> status messages. >> >> The situation may be simulated by some repeatedly disconnects/connects >> >> Versions: >> pacemaker 1.1.6 (and 1.0.5) >> heartbeat 3.0.7 (and 3.0.0) > > There is no heartbeat 3.0.7 (yet), so I guess you meant 2.0.7. > And no doubt that has a few problems. > > You are supposed to use latest heartbeat (now: 3.0.5) > http://hg.linux-ha.org/heartbeat-STABLE_3_0/ > > "Unsteady network": I suppose that means at least packet loss. > There have been long standing bugs with retransmit and packet loss > in pacemaker which I only fixed in 3.0.5. > > There is one more problem in that area I'm aware of, which is much less > likely to trigger, but if it does, you'll know, as heartbeat will start > spinning with high cpu load, and not even notice that a peer node is dead. > > It has been introduced in 2005 or so, is a glib mainloop callback > priority inversion involving timeout based events, only triggers > sometimes, may need even special config parameters in ha.cf to trigger > at all (and needs massive packet loss ...). And it is even fixed as > well. But as I only had one complaint that remotely matched these > symptoms, it is still not in the repos. > > Hopefully we find some time to clean that up > and release it with heartbeat 3.0.6. > > As we finally noticed that some things still packaged with heartbeat > actually belong into glue, we seem to have to cut a release anyways, > "soon". > > BTW, I've seen similar behaviour with corosync as well. > In fact, for me, this exact scenario (node returning after having been > declared dead) typically does NOT work with pacemaker on corosync, > while it always worked with pacemaker on heartbeat...
I believe things are much improved with more recent releases. Hence the recommendation of 1.4.x :-) > > Still, Andrew is right: for new clusters, corosync is the way to go. > > Not so long ago, I posted some corosync vs heartbeat summary > from my current perspective: > http://permalink.gmane.org/gmane.linux.highavailability.user/36903 > >> *** pacemaker 1.0.5 *** > > That's not the most recent, either. > iirc, there have been some fixes in pacemaker as well, > in the area of rejoining after partition. > > But really: you need to fix your "unsteady" network, > and probably should implement stonith. > >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: CRIT: Cluster node lab13 returning >> after partition. >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: For information on cluster >> partitions, See URL: http://linux-ha.org/SplitBrain >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Deadtime value may be too >> small. >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: See FAQ for information on >> tuning >> deadtime. >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: URL: http://linux- >> ha.org/FAQ#heavy_load >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Link lab13:eth0 up. >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: WARN: Late heartbeat: Node lab13: >> interval 244130 ms >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: Status update for node lab13: >> status active >> Mar 8 07:05:50 LAB19 crmd: [3083]: notice: crmd_ha_status_callback: Status >> update: Node lab13 now has status [active] (DC=true) >> Mar 8 07:05:50 LAB19 crmd: [3083]: info: crm_update_peer_proc: lab13.ais is >> now >> online >> Mar 8 07:05:50 LAB19 cib: [3079]: WARN: cib_peer_callback: Discarding >> cib_apply_diff message (14a33) from lab13: not in our membership >> Mar 8 07:05:50 LAB19 heartbeat: [2979]: info: all clients are now paused >> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: mem_handle_event: instance=15, >> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: crmd_ccm_msg_callback: Quorum >> (re)attained after event=NEW MEMBERSHIP (id=15) >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: NEW MEMBERSHIP: >> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: ccm_event_detail: CURRENT: lab19 >> [nodeid=1, born=15] >> Mar 8 07:05:51 LAB19 crmd: [3083]: info: populate_cib_nodes_ha: Requesting >> the >> list of configured nodes >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=15, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM >> event=NEW MEMBERSHIP (id=15) >> Mar 8 07:05:51 LAB19 heartbeat: [2979]: info: all clients are now resumed >> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster >> Mar 8 07:05:51 LAB19 ccm: [3078]: info: Break tie for 2 nodes cluster >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=16, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM >> event=NEW MEMBERSHIP (id=16) >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: no mbr_track info >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 8 07:05:51 LAB19 cib: [3079]: info: mem_handle_event: instance=17, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 8 07:05:51 LAB19 cib: [3079]: info: cib_ccm_msg_callback: Processing CCM >> event=NEW MEMBERSHIP (id=17) >> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: no mbr_track info >> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 8 07:05:52 LAB19 crmd: [3083]: info: mem_handle_event: instance=16, >> nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> : >> >> *** pacemaker 1.1.6 *** >> Mar 16 01:35:28 oan1 heartbeat: [7556]: CRIT: Cluster node oan3 returning >> after >> partition. >> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: For information on cluster >> partitions, See URL: http://linux-ha.org/wiki/Split_Brain >> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Deadtime value may be too >> small. >> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: See FAQ for information on >> tuning >> deadtime. >> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: URL: http://linux- >> ha.org/wiki/FAQ#Heavy_Load >> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Link oan3:eth0 up. >> Mar 16 01:35:28 oan1 heartbeat: [7556]: WARN: Late heartbeat: Node oan3: >> interval 73300 ms >> Mar 16 01:35:28 oan1 heartbeat: [7556]: info: Status update for node oan3: >> status active >> Mar 16 01:35:28 oan1 crmd: [7601]: notice: crmd_ha_status_callback: Status >> update: Node oan3 now has status [active] >> Mar 16 01:35:28 oan1 crmd: [7601]: info: crm_update_peer_proc: oan3.ais is >> now >> online >> Mar 16 01:35:28 oan1 crmd: [7601]: WARN: match_down_event: No match for >> shutdown >> action on a9dae67b-b98f-498b-ad1b-950fc73c2ab2 >> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_update_diff: Stonith/shutdown of >> a9dae67b-b98f-498b-ad1b-950fc73c2ab2 not matched >> Mar 16 01:35:28 oan1 crmd: [7601]: info: abort_transition_graph: >> te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, >> id=a9dae67b-b98f-498b-ad1b-950fc73c2ab2, magic=NA, cib=0.210.50) : Node >> failure >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State >> transition >> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph ] >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: All 1 cluster >> nodes are eligible to run resources. >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke: Query 204: Requesting >> the >> current CIB: S_POLICY_ENGINE >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_pe_invoke_callback: Invoking the >> PE: >> query=204, ref=pe_calc-dc-1331858128-123, seq=13, quorate=1 >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave >> rsc_clustercontrol (Started oan1) >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave rscIpaddr2 >> (Started oan1) >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave >> rsc_asterisk >> (Started oan1) >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave >> rsc_tomcat5 >> (Started oan1) >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: LogActions: Leave >> rsc_ntp_refresh (Started oan1) >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State >> transition >> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS >> cause=C_IPC_MESSAGE >> origin=handle_response ] >> Mar 16 01:35:28 oan1 crmd: [7601]: info: unpack_graph: Unpacked transition >> 16: 0 >> actions in 0 synapses >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_te_invoke: Processing graph 16 >> (ref=pe_calc-dc-1331858128-123) derived from >> /var/lib/pengine/pe-input-201.bz2 >> Mar 16 01:35:28 oan1 pengine: [17673]: notice: process_pe_message: Transition >> 16: PEngine Input stored in: /var/lib/pengine/pe-input-201.bz2 >> Mar 16 01:35:28 oan1 crmd: [7601]: info: run_graph: >> ==================================================== >> Mar 16 01:35:28 oan1 crmd: [7601]: notice: run_graph: Transition 16 >> (Complete=0, >> Pending=0, Fired=0, Skipped=0, Incomplete=0, >> Source=/var/lib/pengine/pe-input- >> 201.bz2): Complete >> Mar 16 01:35:28 oan1 crmd: [7601]: info: te_graph_trigger: Transition 16 is >> now >> complete >> Mar 16 01:35:28 oan1 crmd: [7601]: info: notify_crmd: Transition 16 status: >> done >> - <null> >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: State >> transition >> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL >> origin=notify_crmd ] >> Mar 16 01:35:28 oan1 crmd: [7601]: info: do_state_transition: Starting >> PEngine >> Recheck Timer >> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_process_diff: Diff 0.210.56 -> >> 0.210.57 not applied to 0.210.50: current "num_updates" is less than required >> Mar 16 01:35:30 oan1 cib: [7597]: WARN: cib_server_process_diff: Not >> requesting >> full refresh in R/W mode >> Mar 16 01:35:30 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster >> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info >> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 16 01:35:30 oan1 cib: [7597]: info: mem_handle_event: instance=14, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:30 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM >> event=NEW MEMBERSHIP (id=14) >> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info >> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 16 01:35:30 oan1 crmd: [7601]: info: mem_handle_event: instance=14, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:30 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum >> (re)attained after event=NEW MEMBERSHIP (id=14) >> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: >> trans=14, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:30 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 >> [nodeid=0, born=14] >> Mar 16 01:35:30 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting >> the >> list of configured nodes >> Mar 16 01:35:31 oan1 ccm: [7596]: info: Break tie for 2 nodes cluster >> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: no mbr_track info >> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 16 01:35:31 oan1 crmd: [7601]: info: mem_handle_event: instance=15, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event >> OC_EV_MS_INVALID from ccm >> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: no mbr_track info >> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: Got an event >> OC_EV_MS_NEW_MEMBERSHIP from ccm >> Mar 16 01:35:31 oan1 cib: [7597]: info: mem_handle_event: instance=15, >> nodes=1, >> new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_ccm_msg_callback: Processing CCM >> event=NEW MEMBERSHIP (id=15) >> Mar 16 01:35:31 oan1 crmd: [7601]: info: crmd_ccm_msg_callback: Quorum >> (re)attained after event=NEW MEMBERSHIP (id=15) >> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: NEW MEMBERSHIP: >> trans=15, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 >> Mar 16 01:35:31 oan1 crmd: [7601]: info: ccm_event_detail: CURRENT: oan1 >> [nodeid=0, born=15] >> Mar 16 01:35:31 oan1 cib: [7597]: info: cib_process_request: Operation >> complete: >> op cib_modify for section nodes (origin=local/crmd/205, version=0.210.51): ok >> (rc=0) >> Mar 16 01:35:31 oan1 crmd: [7601]: info: populate_cib_nodes_ha: Requesting >> the >> list of configured nodes >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org