Re: [Linux-HA] Nodes not seeing each other

Andreas Kurz Tue, 26 Jun 2012 13:18:41 -0700

On 06/26/2012 07:48 PM, Marcus Bointon wrote:
> I've just realised I have a classic split-brain with my CRM setup. I'm 
> running pacemaker 1.1.6-2ubuntu0~ppa2 (installed from 
> ubuntu-ha-maintainers-ppa-lucid) and heartbeat 1:3.0.5-3ubuntu0~ppa1 on 
> Ubuntu Lucid. I have 3 IPAddr2, 3 SendArp, 3 MailTo resources set up on two 
> servers (front ends running haproxy). This was all working fine, but I 
> checked crm_mon today and found that each node shows the other as offline, 
> and they are both publishing the same floating IPs simultaneously! Wierdly, 
> everything still seems to be working!
> I can't see any reason for this - it was working fine previously and config 
> has not changed: servers are up and running, firewall ports are open (each 
> node allows UDP on port 694 from the other machine). crm_mon shows this:
>


use STONITH to prevent resources running on both nodes ... you
configured redundant cluster communication paths?

> ============
> Last updated: Tue Jun 26 16:28:23 2012
> Last change: Tue Mar 27 22:19:17 2012
> Stack: Heartbeat
> Current DC: proxy1.example.com (68890308-615b-4b28-bb8b-5aa00bdbf65c) - 
> partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 1 expected votes
> 10 Resources configured.
> ============
> 
> Online: [ proxy1.example.com ]
> OFFLINE: [ proxy2.example.com ]
> 
>  Resource Group: proxyfloat
>      ip1        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
>      ip1arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
>      ip1email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
>  Resource Group: proxyfloat2
>      ip2        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
>      ip2arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
>      ip2email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
>  Resource Group: proxyfloat3
>      ip3        (ocf::heartbeat:IPaddr2):       Started proxy1.example.com
>      ip3arp     (ocf::heartbeat:SendArp):       Started proxy1.example.com
>      ip3email   (ocf::heartbeat:MailTo):        Started proxy1.example.com
>      
> ============
> Last updated: Tue Jun 26 16:28:09 2012
> Last change: Tue Mar 27 22:19:17 2012
> Stack: Heartbeat
> Current DC: proxy2.example.com (30a5636b-26f6-4c31-9ea7-d4fb912ee624) - 
> partition with quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 2 Nodes configured, 1 expected votes
> 10 Resources configured.
> ============
> 
> Online: [ proxy2.example.com ]
> OFFLINE: [ proxy1.example.com ]
> 
>  Resource Group: proxyfloat
>      ip1        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
>      ip1arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
>      ip1email   (ocf::heartbeat:MailTo):        Started proxy2.example.com
>  Resource Group: proxyfloat2
>      ip2        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
>      ip2arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
>      ip2email   (ocf::heartbeat:MailTo):        Started proxy2.example.com
>  Resource Group: proxyfloat3
>      ip3        (ocf::heartbeat:IPaddr2):       Started proxy2.example.com
>      ip3arp     (ocf::heartbeat:SendArp):       Started proxy2.example.com
>      ip3email   (ocf::heartbeat:MailTo):        Started proxy2.example.com
> 
> Both servers are logging this sequence every 10 minutes or so:
> 
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: crm_timer_popped: PEngine Recheck 
> Timer (I_PE_CALC) just popped (900000ms)

this is the cluster-recheck-interval every 15min

> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
> origin=crm_timer_popped ]
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Progressed to 
> state S_POLICY_ENGINE after C_TIMER_POPPED
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: All 1 cluster 
> nodes are eligible to run resources.
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke: Query 1746: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_pe_invoke_callback: Invoking 
> the PE: query=1746, ref=pe_calc-dc-1340693058-1731, seq=3, quorate=1
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
> ip2arp_last_failure_0 found resource ip2arp active on proxy1.example.com
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
> ip1arp_last_failure_0 found resource ip1arp active on proxy1.example.com
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
> ip3_last_failure_0 found resource ip3 active on proxy1.example.com
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: unpack_rsc_op: Operation 
> ip3arp_last_failure_0 found resource ip3arp active on proxy1.example.com
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> email_alert#011(Stopped)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip1#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip1arp#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip1email#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip2#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip2arp#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip2email#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip3#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip3arp#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: unpack_graph: Unpacked transition 
> 1653: 0 actions in 0 synapses
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: LogActions: Leave   
> ip3email#011(Started proxy1.example.com)
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_te_invoke: Processing graph 
> 1653 (ref=pe_calc-dc-1340693058-1731) derived from 
> /var/lib/pengine/pe-input-35.bz2
> Jun 26 06:44:18 proxy1 pengine: [3207]: notice: process_pe_message: 
> Transition 1653: PEngine Input stored in: /var/lib/pengine/pe-input-35.bz2
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: run_graph: 
> ====================================================
> Jun 26 06:44:18 proxy1 crmd: [3205]: notice: run_graph: Transition 1653 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pengine/pe-input-35.bz2): Complete
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: te_graph_trigger: Transition 1653 
> is now complete
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: notify_crmd: Transition 1653 
> status: done - <null>
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jun 26 06:44:18 proxy1 crmd: [3205]: info: do_state_transition: Starting 
> PEngine Recheck Timer
> 
> How can I diagnose why they are not talking to each other?

With heartbeat you can use the "cl_status" command with its various
options to check Heartbeats view of the cluster .... and heartbeats log
messages from the split-brain event should also give you some hints.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Marcus
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Nodes not seeing each other

Reply via email to