Andrew: Thanks for your answer!
-- Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador ggo...@cooperativaobrera.coop Gerencia de Sistemas - Cooperativa Obrera Ltda. Tel: (0291) 403-9700 On 03/17/2014 09:39 PM, Andrew Beekhof wrote: > On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> > wrote: > >> Hi to all! >> >> We've a 4 node cluster and recently experienced a weird issue with Pacemaker >> that resulted in three >> database instance resources duplicated (running simultaneously in 2 nodes) >> and subsequent data >> corruption. >> >> I've been investigating logs and could not arrive to a conclusion as to why >> did that happened. So >> I'm writing to the list with details of the event to see if someone can help >> me pinpoint if there >> was some problem with our operation of maybe we hit some bug. >> >> OS: CentOS 6.4 >> Pacemaker version: 1.1.8 > 6.4 should have 1.1.10 available by now. Yes, we are planning an upgrade to CentOS 6.5 and 1.1.10. But before that we wanted to investigate this issue. > - So, we decided to delete the operation monitor from the resource that > caused the timeout (a LVM > Volume Group) to avoid having that same problem again. We did 'crm configure > edit' and deleted two > op monitor, and then just after commiting that modification to the crm we > experienced the problem of > resource duplication: > > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: --- 0.526.3 > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: +++ 0.527.1 > ebf608fdccda78e755a61cdeea18a659 > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op > id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" /> > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op > id="vg_ifx_dss-monitor-240" > interval="240" name="monitor" timeout="120" /> > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: ++ <cib epoch="527" > num_updates="1" > admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7" > update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" > cib-last-written="Sat Mar > 8 08:58:40 2014" have-quorum="1" > dc-uuid="lorien.san01.cooperativaobrera.coop" /> > Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State > transition S_IDLE -> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State > transition S_ELECTION -> > S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL > origin=do_election_check ] > Mar 8 09:13:52 lorien crmd[9934]: notice: do_dc_takeover: Marking > gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as > clean > ^^^ This is significant. > > The problem is triggered by crmsh performing a replace operation - which > causes an election. > When the election is completed, the DC checks the stonith_cleanup_list (which > evidently contains old data) and erases gandalf's status section (causing > everyone to think its safely offline). > > Do you see anything like: > >> Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Target >> may have been our >> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>) > on lorien? Yes, 8:43 we have: Mar 8 08:43:21 lorien corosync[9135]: [QUORUM] Members[3]: 2 3 4 Mar 8 08:43:21 lorien corosync[9135]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 8 08:43:21 lorien crmd[9934]: notice: crm_update_peer_state: cman_event_callback: Node gandalf.san01.cooperativaobrera.coop[1] - state is now lost Mar 8 08:43:21 lorien crmd[9934]: warning: check_dead_member: Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster Mar 8 08:43:21 lorien kernel: dlm: closing connection to node 1 Mar 8 08:43:21 lorien corosync[9135]: [CPG ] chosen downlist: sender r(0) ip(172.16.1.1) r(1) ip(172.16.2.1) ; members(old:4 left:1) Mar 8 08:43:21 lorien corosync[9135]: [MAIN ] Completed service synchronization, ready to provide service. Mar 8 08:43:21 lorien fenced[9186]: fencing deferred to isildur.san01.cooperativaobrera.coop Mar 8 08:43:22 lorien crmd[9934]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] Mar 8 08:43:22 lorien crmd[9934]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Mar 8 08:43:22 lorien stonith-ng[9930]: notice: remote_op_done: Operation reboot of gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK Mar 8 08:43:22 lorien cib[9929]: warning: cib_process_diff: Diff 0.513.82 -> 0.513.83 from mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current "num_updates" is greater than required Mar 8 08:43:22 lorien cib[9929]: warning: cib_process_diff: Diff 0.513.83 -> 0.513.84 from mordor.san01.cooperativaobrera.coop not applied to 0.513.84: current "num_updates" is greater than required Mar 8 08:43:22 lorien cib[9929]: notice: cib_process_diff: Diff 0.513.84 -> 0.513.85 from mordor.san01.cooperativaobrera.coop not applied to 0.513.84: Failed application of an update diff Mar 8 08:43:22 lorien cib[9929]: warning: cib_server_process_diff: Not requesting full refresh in R/W mode Mar 8 08:43:22 lorien cib[9929]: warning: cib_diff_notify: Update (client: crmd, call:256): 0.513.84 -> 0.513.85 (Application of an update diff failed) Mar 8 08:43:22 lorien crmd[9934]: notice: tengine_stonith_notify: Peer gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561 Mar 8 08:43:22 lorien crmd[9934]: notice: tengine_stonith_notify: Notified CMAN that 'gandalf.san01.cooperativaobrera.coop' is now fenced Mar 8 08:43:22 lorien crmd[9934]: notice: tengine_stonith_notify: Target may have been our leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>) > Strange that there was no DC at the time gandalf was fenced. > Do you have a log file configured too? Those have more detail than syslog. Yes we have corosync log file with this, for example: Mar 08 08:43:21 corosync [QUORUM] Members[3]: 2 3 4 Mar 08 08:43:21 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar crmd: info: cman_event_callback: Membership 5732: quorum retained Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar crmd: notice: crm_update_peer_state: cman_event_callback: Node gandalf.san01.cooperativaobrera.coop[1] - state is now lost Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar crmd: info: peer_update_callback: gandalf.san01.cooperativaobrera.coop is now lost (was member) Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar crmd: warning: check_dead_member: Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster Mar 08 08:43:21 [9934] lorien.cooperativaobrera.com.ar crmd: info: send_ais_text: Peer overloaded or membership in flux: Re-sending message (Attempt 1 of 20) Maybe, this is significant : 'Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster' ... ? Please tell me if you need more details: > I have also made the following commit which may also help: > https://github.com/beekhof/pacemaker/commit/cef2446 >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org