Re: [Pacemaker] Node in pending state, resources duplicated and data corruption

Gabriel Gomiz Tue, 18 Mar 2014 04:10:33 -0700

Andrew:

Thanks for your answer!

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggo...@cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700

On 03/17/2014 09:39 PM, Andrew Beekhof wrote:
> On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
> wrote:
>
>> Hi to all!
>>
>> We've a 4 node cluster and recently experienced a weird issue with Pacemaker 
>> that resulted in three
>> database instance resources duplicated (running simultaneously in 2 nodes) 
>> and subsequent data
>> corruption.
>>
>> I've been investigating logs and could not arrive to a conclusion as to why 
>> did that happened. So
>> I'm writing to the list with details of the event to see if someone can help 
>> me pinpoint if there
>> was some problem with our operation of maybe we hit some bug.
>>
>> OS: CentOS 6.4
>> Pacemaker version: 1.1.8
> 6.4 should have 1.1.10 available by now.

Yes, we are planning an upgrade to CentOS 6.5 and 1.1.10. But before that we 
wanted to investigate
this issue.

> - So, we decided to delete the operation monitor from the resource that 
> caused the timeout (a LVM
> Volume Group) to avoid having that same problem again. We did 'crm configure 
> edit' and deleted two
> op monitor, and then just after commiting that modification to the crm we 
> experienced the problem of
> resource duplication:
>
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 
> ebf608fdccda78e755a61cdeea18a659
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
> id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op 
> id="vg_ifx_dss-monitor-240"
> interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" 
> num_updates="1"
> admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
> update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" 
> cib-last-written="Sat Mar 
> 8 08:58:40 2014" have-quorum="1" 
> dc-uuid="lorien.san01.cooperativaobrera.coop" />
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
> transition S_IDLE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
> transition S_ELECTION ->
> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL 
> origin=do_election_check ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
> gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as 
> clean
> ^^^ This is significant. 
>
> The problem is triggered by crmsh performing a replace operation - which 
> causes an election.
> When the election is completed, the DC checks the stonith_cleanup_list (which 
> evidently contains old data) and erases gandalf's status section (causing 
> everyone to think its safely offline).
>
> Do you see anything like:
>
>> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target 
>> may have been our
>> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
> on lorien?

Yes, 8:43 we have:

Mar  8 08:43:21 lorien corosync[9135]:   [QUORUM] Members[3]: 2 3 4
Mar  8 08:43:21 lorien corosync[9135]:   [TOTEM ] A processor joined or left 
the membership and a
new membership was formed.
Mar  8 08:43:21 lorien crmd[9934]:   notice: crm_update_peer_state: 
cman_event_callback: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 08:43:21 lorien crmd[9934]:  warning: check_dead_member: Our DC node
(gandalf.san01.cooperativaobrera.coop) left the cluster
Mar  8 08:43:21 lorien kernel: dlm: closing connection to node 1
Mar  8 08:43:21 lorien corosync[9135]:   [CPG   ] chosen downlist: sender r(0) 
ip(172.16.1.1) r(1)
ip(172.16.2.1) ; members(old:4 left:1)
Mar  8 08:43:21 lorien corosync[9135]:   [MAIN  ] Completed service 
synchronization, ready to
provide service.
Mar  8 08:43:21 lorien fenced[9186]: fencing deferred to 
isildur.san01.cooperativaobrera.coop
Mar  8 08:43:22 lorien crmd[9934]:   notice: do_state_transition: State 
transition S_NOT_DC ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  8 08:43:22 lorien crmd[9934]:   notice: do_state_transition: State 
transition S_ELECTION ->
S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL 
origin=do_election_check ]
Mar  8 08:43:22 lorien stonith-ng[9930]:   notice: remote_op_done: Operation 
reboot of
gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for
crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_process_diff: Diff 0.513.82 -> 
0.513.83 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current 
"num_updates" is greater than
required
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_process_diff: Diff 0.513.83 -> 
0.513.84 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current 
"num_updates" is greater than
required
Mar  8 08:43:22 lorien cib[9929]:   notice: cib_process_diff: Diff 0.513.84 -> 
0.513.85 from
mordor.san01.cooperativaobrera.coop not applied to 0.513.84: Failed application 
of an update diff
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_server_process_diff: Not 
requesting full refresh in
R/W mode
Mar  8 08:43:22 lorien cib[9929]:  warning: cib_diff_notify: Update (client: 
crmd, call:256):
0.513.84 -> 0.513.85 (Application of an update diff failed)
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK
(ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Notified 
CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced
Mar  8 08:43:22 lorien crmd[9934]:   notice: tengine_stonith_notify: Target may 
have been our leader
gandalf.san01.cooperativaobrera.coop (recorded: <unset>)

> Strange that there was no DC at the time gandalf was fenced.  
> Do you have a log file configured too? Those have more detail than syslog.
Yes we have corosync log file with this, for example:

Mar 08 08:43:21 corosync [QUORUM] Members[3]: 2 3 4
Mar 08 08:43:21 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was
formed.
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info:
cman_event_callback:       Membership 5732: quorum retained
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:   notice:
crm_update_peer_state:     cman_event_callback: Node 
gandalf.san01.cooperativaobrera.coop[1] - state
is now lost
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info:
peer_update_callback:      gandalf.san01.cooperativaobrera.coop is now lost 
(was member)
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:  warning:
check_dead_member:         Our DC node (gandalf.san01.cooperativaobrera.coop) 
left the cluster
Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar       crmd:     info: 
send_ais_text:     Peer
overloaded or membership in flux: Re-sending message (Attempt 1 of 20)

Maybe, this is significant : 'Our DC node 
(gandalf.san01.cooperativaobrera.coop) left the cluster' ... ?

Please tell me if you need more details:

> I have also made the following commit which may also help:
>    https://github.com/beekhof/pacemaker/commit/cef2446
>

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Node in pending state, resources duplicated and data corruption

Reply via email to