Re: [Pacemaker] Node in pending state, resources duplicated and data corruption

Andrew Beekhof Mon, 17 Mar 2014 17:44:33 -0700

On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
wrote:


> Hi to all!
> 
> We've a 4 node cluster and recently experienced a weird issue with Pacemaker 
> that resulted in three
> database instance resources duplicated (running simultaneously in 2 nodes) 
> and subsequent data
> corruption.
> 
> I've been investigating logs and could not arrive to a conclusion as to why 
> did that happened. So
> I'm writing to the list with details of the event to see if someone can help 
> me pinpoint if there
> was some problem with our operation of maybe we hit some bug.
> 
> OS: CentOS 6.4
> Pacemaker version: 1.1.8

6.4 should have 1.1.10 available by now.

> Stack: cman
> Stonith enabled via DELL iDRAC in all 4 nodes
> Nodes: gandalf, isildur, mordor, lorien
> 
> Timeline of events and logs:
> 
> - A resource monitor operation times out and resources in that node (gandalf) 
> are being stopped
> 
> Mar  8 08:41:09 gandalf crmd[31561]:    error: process_lrm_event: LRM 
> operation
> vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms)
> 
> - Stopping resources in that node (gandalf) also times out and node is being 
> killed by stonith from
> other node (mordor)

sounds reasonable so far

> 
> Mar  8 08:42:54 gandalf lrmd[31558]:  warning: child_timeout_callback: 
> vg_ifx_oltp_stop_0 process
> (PID 17816) timed out
> ....
> Mar  8 08:42:55 gandalf pengine[31560]:  warning: unpack_rsc_op: Processing 
> failed op stop for
> vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1)
> Mar  8 08:42:55 gandalf pengine[31560]:  warning: pe_fence_node: Node
> gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource 
> failure(s)
> 
> Mar  8 08:43:09 mordor corosync[25977]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Mar  8 08:43:09 mordor stonith-ng[26212]:   notice: log_operation: Operation 
> 'reboot' [4612] (call 0
> from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 
> 'st-gandalf' returned:
> 0 (OK)
> Mar  8 08:43:21 mordor corosync[25977]:   [QUORUM] Members[3]: 2 3 4
> Mar  8 08:43:21 mordor corosync[25977]:   [TOTEM ] A processor joined or left 
> the membership and a
> new membership was formed.
> Mar  8 08:43:21 mordor crmd[26216]:   notice: crm_update_peer_state: 
> cman_event_callback: Node
> gandalf.san01.cooperativaobrera.coop[1] - state is now lost
> Mar  8 08:43:21 mordor crmd[26216]:  warning: check_dead_member: Our DC node
> (gandalf.san01.cooperativaobrera.coop) left the cluster
> Mar  8 08:43:21 mordor kernel: dlm: closing connection to node 1
> Mar  8 08:43:21 mordor corosync[25977]:   [CPG   ] chosen downlist: sender 
> r(0) ip(172.16.1.1) r(1)
> ip(172.16.2.1) ; members(old:4 left:1)
> Mar  8 08:43:21 mordor corosync[25977]:   [MAIN  ] Completed service 
> synchronization, ready to
> provide service.
> Mar  8 08:43:21 mordor fenced[26045]: fencing deferred to 
> isildur.san01.cooperativaobrera.coop
> Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State 
> transition S_NOT_DC ->
> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
> Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State 
> transition S_ELECTION ->
> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
> origin=do_election_count_vote ]
> Mar  8 08:43:22 mordor stonith-ng[26212]:   notice: remote_op_done: Operation 
> reboot of
> gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop 
> for
> crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
> gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
> mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: 
> OK
> (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: 
> Notified CMAN that
> 'gandalf.san01.cooperativaobrera.coop' is now fenced
> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target 
> may have been our
> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.82 
> -> 0.513.83 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
> "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.83 
> -> 0.513.84 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
> "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.84 
> -> 0.513.85 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
> "num_updates" is greater than
> required
> Mar  8 08:43:22 mordor cib[26211]:   notice: cib_process_diff: Diff 0.513.85 
> -> 0.513.86 from
> lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed 
> application of an update diff
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: last-failure-srv_mysql_dss (1384966716)
> Mar  8 08:43:27 mordor crmd[26216]:   notice: do_state_transition: State 
> transition S_PENDING ->
> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
> origin=do_cl_join_finalize_respond ]
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: probe_complete (true)
> Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: last-failure-vm_suc8300 (1384966659)
> Mar  8 08:43:43 mordor stonith-ng[26212]:   notice: remote_op_done: Operation 
> reboot of
> gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop 
> for
> crmd.9...@lorien.san01.cooperativaobrera.coop.3b50e760: OK
> Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
> gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
> lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: 
> OK
> (ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934
> Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: 
> Notified CMAN that
> 'gandalf.san01.cooperativaobrera.coop' is now fenced
> 
> - Up to this point, gandalf resources are passed CORRECTLY to other cluster 
> members. We had the
> cluster with 3 nodes online and 1 node offline. Everything was working ok.
> 
> - After gandalf reboot is complete we checked node status, and then started 
> pacemaker again manually
> and node reacquires its resources CORRECTLY. So again we had the cluster with 
> all 4 nodes online and
> all resources running in their preferred nodes. Log of last resource 
> transition back to gandalf:
> 
> Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    
> fs_mysql_hermes#011(Started
> mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
> Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    
> ip_mysql_hermes#011(Started
> mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
> Mar  8 08:51:40 lorien pengine[9933]:   notice: process_pe_message: 
> Calculated Transition 16:
> /var/lib/pacemaker/pengine/pe-input-14.bz2
> Mar  8 08:51:43 lorien crmd[9934]:   notice: run_graph: Transition 16 
> (Complete=9, Pending=0,
> Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
> Mar  8 08:51:43 lorien crmd[9934]:   notice: do_state_transition: State 
> transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
> origin=notify_crmd ]
> 
> - So, we decided to delete the operation monitor from the resource that 
> caused the timeout (a LVM
> Volume Group) to avoid having that same problem again. We did 'crm configure 
> edit' and deleted two
> op monitor, and then just after commiting that modification to the crm we 
> experienced the problem of
> resource duplication:
> 
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 
> ebf608fdccda78e755a61cdeea18a659
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
> id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op 
> id="vg_ifx_dss-monitor-240"
> interval="240" name="monitor" timeout="120" />
> Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" 
> num_updates="1"
> admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
> update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" 
> cib-last-written="Sat Mar 
> 8 08:58:40 2014" have-quorum="1" 
> dc-uuid="lorien.san01.cooperativaobrera.coop" />
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
> transition S_IDLE ->
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
> transition S_ELECTION ->
> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL 
> origin=do_election_check ]
> Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
> gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as 
> clean

^^^ This is significant. 

The problem is triggered by crmsh performing a replace operation - which causes 
an election.
When the election is completed, the DC checks the stonith_cleanup_list (which 
evidently contains old data) and erases gandalf's status section (causing 
everyone to think its safely offline).

Do you see anything like:

> Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target 
> may have been our
> leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)

on lorien?

Strange that there was no DC at the time gandalf was fenced.  
Do you have a log file configured too? Those have more detail than syslog.

I have also made the following commit which may also help:
   https://github.com/beekhof/pacemaker/commit/cef2446



> Mar  8 09:13:52 lorien crmd[9934]:  warning: match_down_event: No match for 
> shutdown action on
> gandalf.san01.cooperativaobrera.coop
> Mar  8 09:13:52 lorien crmd[9934]:   notice: peer_update_callback: 
> Stonith/shutdown of
> gandalf.san01.cooperativaobrera.coop not matched
> Mar  8 09:13:52 lorien crmd[9934]:   notice: crm_update_peer_state: 
> send_stonith_update: Node
> gandalf.san01.cooperativaobrera.coop[1] - state is now lost
> Mar  8 09:13:53 lorien crmd[9934]:  warning: do_state_transition: 3 cluster 
> nodes failed to respond
> to the join offer.
> Mar  8 09:13:53 lorien crmd[9934]:  warning: do_log: FSA: Input I_ELECTION_DC 
> from
> do_dc_join_finalize() received in state S_FINALIZE_JOIN
> Mar  8 09:13:53 lorien crmd[9934]:    error: crmd_ais_dispatch: Recieving 
> messages from a node we
> think is dead: gandalf.san01.cooperativaobrera.coop[1]
> Mar  8 09:13:53 lorien crmd[9934]:  warning: match_down_event: No match for 
> shutdown action on
> gandalf.san01.cooperativaobrera.coop
> Mar  8 09:13:53 lorien crmd[9934]:   notice: peer_update_callback: 
> Stonith/shutdown of
> gandalf.san01.cooperativaobrera.coop not matched
> Mar  8 09:13:53 lorien crmd[9934]:   notice: do_election_count_vote: Election 
> 6 (current: 6, owner:
> lorien.san01.cooperativaobrera.coop): Processed no-vote from 
> gandalf.san01.cooperativaobrera.coop
> (Peer is not part of our cluster)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> 
> - After that, node gandalf was in "pending" state but it was running all of 
> its preferred resources.
> All other 3 nodes were online and also running its preferred resources, but 
> then cluster instructed
> to start gandalf's resources in other nodes, but did not ordered to stonith 
> gandalf, so => duplication:
> 
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: last-failure-srv_mysql_hermes (1394279494)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: last-failure-srv_mysql_oltp (1394279084)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: probe_complete (true)
> Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts
> for: last-failure-srv_mysql_it (1384967023)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> st-lorien#011(isildur.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_jabberd#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_conference#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_orangutan#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_jabber#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_jabber#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> ip_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
> srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop)
> Mar  8 09:14:01 lorien pengine[9933]:   notice: process_pe_message: 
> Calculated Transition 53:
> /var/lib/pacemaker/pengine/pe-input-50.bz2
> 
> - These last logs are the DC orders to start gandalf's resources in other 
> members.
> 
> - Our databases executed in duplicated instances for 7 minutes, but that was 
> sufficient for build
> some pretty wild data corruption and we had to go to complete backup restore 
> those.
> 
> Can you identify some problem in this operation? We did something wrong? Is 
> this a known bug? I'm
> thinking of something related to the first stonith kill (gandalf was DC at 
> that time) and some
> inconsistent state after that. Also there is an election (?) just in the 
> moment of passing the cib
> modification to other members. Could this be related?
> 
> Many thanks in advance for any insight you could help us with.
> 
> Kind regards!
> 
> -- 
> Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
> ggo...@cooperativaobrera.coop
> Gerencia de Sistemas - Cooperativa Obrera Ltda.
> Tel: (0291) 403-9700
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Node in pending state, resources duplicated and data corruption

Reply via email to