On 06/26/2012 07:48 PM, Marcus Bointon wrote: > I've just realised I have a classic split-brain with my CRM setup. I'm > running pacemaker 1.1.6-2ubuntu0~ppa2 (installed from > ubuntu-ha-maintainers-ppa-lucid) and heartbeat 1:3.0.5-3ubuntu0~ppa1 on > Ubuntu Lucid. I have 3 IPAddr2, 3 SendArp, 3 MailTo resources set up on two > servers (front ends running haproxy). This was all working fine, but I > checked crm_mon today and found that each node shows the other as offline, > and they are both publishing the same floating IPs simultaneously! Wierdly, > everything still seems to be working! > I can't see any reason for this - it was working fine previously and config > has not changed: servers are up and running, firewall ports are open (each > node allows UDP on port 694 from the other machine). crm_mon shows this: >
use STONITH to prevent resources running on both nodes ... you configured redundant cluster communication paths? > ============ > Last updated: Tue Jun 26 16:28:23 2012 > Last change: Tue Mar 27 22:19:17 2012 > Stack: Heartbeat > Current DC: proxy1.example.com (68890308-615b-4b28-bb8b-5aa00bdbf65c) - > partition with quorum > Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c > 2 Nodes configured, 1 expected votes > 10 Resources configured. > ============ > > Online: [ proxy1.example.com ] > OFFLINE: [ proxy2.example.com ] > > Resource Group: proxyfloat > ip1 (ocf::heartbeat:IPaddr2): Started proxy1.example.com > ip1arp (ocf::heartbeat:SendArp): Started proxy1.example.com > ip1email (ocf::heartbeat:MailTo): Started proxy1.example.com > Resource Group: proxyfloat2 > ip2 (ocf::heartbeat:IPaddr2): Started proxy1.example.com > ip2arp (ocf::heartbeat:SendArp): Started proxy1.example.com > ip2email (ocf::heartbeat:MailTo): Started proxy1.example.com > Resource Group: proxyfloat3 > ip3 (ocf::heartbeat:IPaddr2): Started proxy1.example.com > ip3arp (ocf::heartbeat:SendArp): Started proxy1.example.com > ip3email (ocf::heartbeat:MailTo): Started proxy1.example.com > > ============ > Last updated: Tue Jun 26 16:28:09 2012 > Last change: Tue Mar 27 22:19:17 2012 > Stack: Heartbeat > Current DC: proxy2.example.com (30a5636b-26f6-4c31-9ea7-d4fb912ee624) - > partition with quorum > Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c > 2 Nodes configured, 1 expected votes > 10 Resources configured. > ============ > > Online: [ proxy2.example.com ] > OFFLINE: [ proxy1.example.com ] > > Resource Group: proxyfloat > ip1 (ocf::heartbeat:IPaddr2): Started proxy2.example.com > ip1arp (ocf::heartbeat:SendArp): Started proxy2.example.com > ip1email (ocf::heartbeat:MailTo): Started proxy2.example.com > Resource Group: proxyfloat2 > ip2 (ocf::heartbeat:IPaddr2): Started proxy2.example.com > ip2arp (ocf::heartbeat:SendArp): Started proxy2.example.com > ip2email (ocf::heartbeat:MailTo): Started proxy2.example.com > Resource Group: proxyfloat3 > ip3 (ocf::heartbeat:IPaddr2): Started proxy2.example.com > ip3arp (ocf::heartbeat:SendArp): Started proxy2.example.com > ip3email (ocf::heartbeat:MailTo): Started proxy2.example.com > > Both servers are logging this sequence every 10 minutes or so: > > Jun 26 06:44:18 proxy1 crmd: [3205]: info: crm_timer_popped: PEngine Recheck > Timer (I_PE_CALC) just popped (900000ms) this is the cluster-recheck-interval every 15min > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED > origin=crm_timer_popped ] > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Progressed to > state S_POLICY_ENGINE after C_TIMER_POPPED > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: All 1 cluster > nodes are eligible to run resources. > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke: Query 1746: > Requesting the current CIB: S_POLICY_ENGINE > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke_callback: Invoking > the PE: query=1746, ref=pe_calc-dc-1340693058-1731, seq=3, quorate=1 > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_config: On loss of CCM > Quorum: Ignore > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation > ip2arp_last_failure_0 found resource ip2arp active on proxy1.example.com > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation > ip1arp_last_failure_0 found resource ip1arp active on proxy1.example.com > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation > ip3_last_failure_0 found resource ip3 active on proxy1.example.com > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation > ip3arp_last_failure_0 found resource ip3arp active on proxy1.example.com > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > email_alert#011(Stopped) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip1#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip1arp#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip1email#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip2#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip2arp#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip2email#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip3#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip3arp#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 crmd: [3205]: info: unpack_graph: Unpacked transition > 1653: 0 actions in 0 synapses > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave > ip3email#011(Started proxy1.example.com) > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_te_invoke: Processing graph > 1653 (ref=pe_calc-dc-1340693058-1731) derived from > /var/lib/pengine/pe-input-35.bz2 > Jun 26 06:44:18 proxy1 pengine: [3207]: notice: process_pe_message: > Transition 1653: PEngine Input stored in: /var/lib/pengine/pe-input-35.bz2 > Jun 26 06:44:18 proxy1 crmd: [3205]: info: run_graph: > ==================================================== > Jun 26 06:44:18 proxy1 crmd: [3205]: notice: run_graph: Transition 1653 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pengine/pe-input-35.bz2): Complete > Jun 26 06:44:18 proxy1 crmd: [3205]: info: te_graph_trigger: Transition 1653 > is now complete > Jun 26 06:44:18 proxy1 crmd: [3205]: info: notify_crmd: Transition 1653 > status: done - <null> > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Starting > PEngine Recheck Timer > > How can I diagnose why they are not talking to each other? With heartbeat you can use the "cl_status" command with its various options to check Heartbeats view of the cluster .... and heartbeats log messages from the split-brain event should also give you some hints. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > Marcus > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
