On 14 Mar 2014, at 10:32 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> wrote:
> Hi to all! > > We've a 4 node cluster and recently experienced a weird issue with Pacemaker > that resulted in three > database instance resources duplicated (running simultaneously in 2 nodes) > and subsequent data > corruption. > > I've been investigating logs and could not arrive to a conclusion as to why > did that happened. So > I'm writing to the list with details of the event to see if someone can help > me pinpoint if there > was some problem with our operation of maybe we hit some bug. > > OS: CentOS 6.4 > Pacemaker version: 1.1.8 6.4 should have 1.1.10 available by now. > Stack: cman > Stonith enabled via DELL iDRAC in all 4 nodes > Nodes: gandalf, isildur, mordor, lorien > > Timeline of events and logs: > > - A resource monitor operation times out and resources in that node (gandalf) > are being stopped > > Mar 8 08:41:09 gandalf crmd[31561]: error: process_lrm_event: LRM > operation > vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms) > > - Stopping resources in that node (gandalf) also times out and node is being > killed by stonith from > other node (mordor) sounds reasonable so far > > Mar 8 08:42:54 gandalf lrmd[31558]: warning: child_timeout_callback: > vg_ifx_oltp_stop_0 process > (PID 17816) timed out > .... > Mar 8 08:42:55 gandalf pengine[31560]: warning: unpack_rsc_op: Processing > failed op stop for > vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1) > Mar 8 08:42:55 gandalf pengine[31560]: warning: pe_fence_node: Node > gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource > failure(s) > > Mar 8 08:43:09 mordor corosync[25977]: [TOTEM ] A processor failed, > forming new configuration. > Mar 8 08:43:09 mordor stonith-ng[26212]: notice: log_operation: Operation > 'reboot' [4612] (call 0 > from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device > 'st-gandalf' returned: > 0 (OK) > Mar 8 08:43:21 mordor corosync[25977]: [QUORUM] Members[3]: 2 3 4 > Mar 8 08:43:21 mordor corosync[25977]: [TOTEM ] A processor joined or left > the membership and a > new membership was formed. > Mar 8 08:43:21 mordor crmd[26216]: notice: crm_update_peer_state: > cman_event_callback: Node > gandalf.san01.cooperativaobrera.coop[1] - state is now lost > Mar 8 08:43:21 mordor crmd[26216]: warning: check_dead_member: Our DC node > (gandalf.san01.cooperativaobrera.coop) left the cluster > Mar 8 08:43:21 mordor kernel: dlm: closing connection to node 1 > Mar 8 08:43:21 mordor corosync[25977]: [CPG ] chosen downlist: sender > r(0) ip(172.16.1.1) r(1) > ip(172.16.2.1) ; members(old:4 left:1) > Mar 8 08:43:21 mordor corosync[25977]: [MAIN ] Completed service > synchronization, ready to > provide service. > Mar 8 08:43:21 mordor fenced[26045]: fencing deferred to > isildur.san01.cooperativaobrera.coop > Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State > transition S_NOT_DC -> > S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] > Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State > transition S_ELECTION -> > S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL > origin=do_election_count_vote ] > Mar 8 08:43:22 mordor stonith-ng[26212]: notice: remote_op_done: Operation > reboot of > gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop > for > crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK > Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Peer > gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by > mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: > OK > (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561 > Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: > Notified CMAN that > 'gandalf.san01.cooperativaobrera.coop' is now fenced > Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Target > may have been our > leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>) > Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.82 > -> 0.513.83 from > lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current > "num_updates" is greater than > required > Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.83 > -> 0.513.84 from > lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current > "num_updates" is greater than > required > Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.84 > -> 0.513.85 from > lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current > "num_updates" is greater than > required > Mar 8 08:43:22 mordor cib[26211]: notice: cib_process_diff: Diff 0.513.85 > -> 0.513.86 from > lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed > application of an update diff > Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_local_callback: Sending > full refresh (origin=crmd) > Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: last-failure-srv_mysql_dss (1384966716) > Mar 8 08:43:27 mordor crmd[26216]: notice: do_state_transition: State > transition S_PENDING -> > S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE > origin=do_cl_join_finalize_respond ] > Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: probe_complete (true) > Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: last-failure-vm_suc8300 (1384966659) > Mar 8 08:43:43 mordor stonith-ng[26212]: notice: remote_op_done: Operation > reboot of > gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop > for > crmd.9...@lorien.san01.cooperativaobrera.coop.3b50e760: OK > Mar 8 08:43:43 mordor crmd[26216]: notice: tengine_stonith_notify: Peer > gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by > lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: > OK > (ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934 > Mar 8 08:43:43 mordor crmd[26216]: notice: tengine_stonith_notify: > Notified CMAN that > 'gandalf.san01.cooperativaobrera.coop' is now fenced > > - Up to this point, gandalf resources are passed CORRECTLY to other cluster > members. We had the > cluster with 3 nodes online and 1 node offline. Everything was working ok. > > - After gandalf reboot is complete we checked node status, and then started > pacemaker again manually > and node reacquires its resources CORRECTLY. So again we had the cluster with > all 4 nodes online and > all resources running in their preferred nodes. Log of last resource > transition back to gandalf: > > Mar 8 08:51:40 lorien pengine[9933]: notice: LogActions: Move > fs_mysql_hermes#011(Started > mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop) > Mar 8 08:51:40 lorien pengine[9933]: notice: LogActions: Move > ip_mysql_hermes#011(Started > mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop) > Mar 8 08:51:40 lorien pengine[9933]: notice: process_pe_message: > Calculated Transition 16: > /var/lib/pacemaker/pengine/pe-input-14.bz2 > Mar 8 08:51:43 lorien crmd[9934]: notice: run_graph: Transition 16 > (Complete=9, Pending=0, > Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete > Mar 8 08:51:43 lorien crmd[9934]: notice: do_state_transition: State > transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > > - So, we decided to delete the operation monitor from the resource that > caused the timeout (a LVM > Volume Group) to avoid having that same problem again. We did 'crm configure > edit' and deleted two > op monitor, and then just after commiting that modification to the crm we > experienced the problem of > resource duplication: > > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: --- 0.526.3 > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: +++ 0.527.1 > ebf608fdccda78e755a61cdeea18a659 > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op > id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" /> > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op > id="vg_ifx_dss-monitor-240" > interval="240" name="monitor" timeout="120" /> > Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: ++ <cib epoch="527" > num_updates="1" > admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7" > update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" > cib-last-written="Sat Mar > 8 08:58:40 2014" have-quorum="1" > dc-uuid="lorien.san01.cooperativaobrera.coop" /> > Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State > transition S_IDLE -> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State > transition S_ELECTION -> > S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL > origin=do_election_check ] > Mar 8 09:13:52 lorien crmd[9934]: notice: do_dc_takeover: Marking > gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as > clean ^^^ This is significant. The problem is triggered by crmsh performing a replace operation - which causes an election. When the election is completed, the DC checks the stonith_cleanup_list (which evidently contains old data) and erases gandalf's status section (causing everyone to think its safely offline). Do you see anything like: > Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Target > may have been our > leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>) on lorien? Strange that there was no DC at the time gandalf was fenced. Do you have a log file configured too? Those have more detail than syslog. I have also made the following commit which may also help: https://github.com/beekhof/pacemaker/commit/cef2446 > Mar 8 09:13:52 lorien crmd[9934]: warning: match_down_event: No match for > shutdown action on > gandalf.san01.cooperativaobrera.coop > Mar 8 09:13:52 lorien crmd[9934]: notice: peer_update_callback: > Stonith/shutdown of > gandalf.san01.cooperativaobrera.coop not matched > Mar 8 09:13:52 lorien crmd[9934]: notice: crm_update_peer_state: > send_stonith_update: Node > gandalf.san01.cooperativaobrera.coop[1] - state is now lost > Mar 8 09:13:53 lorien crmd[9934]: warning: do_state_transition: 3 cluster > nodes failed to respond > to the join offer. > Mar 8 09:13:53 lorien crmd[9934]: warning: do_log: FSA: Input I_ELECTION_DC > from > do_dc_join_finalize() received in state S_FINALIZE_JOIN > Mar 8 09:13:53 lorien crmd[9934]: error: crmd_ais_dispatch: Recieving > messages from a node we > think is dead: gandalf.san01.cooperativaobrera.coop[1] > Mar 8 09:13:53 lorien crmd[9934]: warning: match_down_event: No match for > shutdown action on > gandalf.san01.cooperativaobrera.coop > Mar 8 09:13:53 lorien crmd[9934]: notice: peer_update_callback: > Stonith/shutdown of > gandalf.san01.cooperativaobrera.coop not matched > Mar 8 09:13:53 lorien crmd[9934]: notice: do_election_count_vote: Election > 6 (current: 6, owner: > lorien.san01.cooperativaobrera.coop): Processed no-vote from > gandalf.san01.cooperativaobrera.coop > (Peer is not part of our cluster) > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_local_callback: Sending > full refresh (origin=crmd) > > - After that, node gandalf was in "pending" state but it was running all of > its preferred resources. > All other 3 nodes were online and also running its preferred resources, but > then cluster instructed > to start gandalf's resources in other nodes, but did not ordered to stonith > gandalf, so => duplication: > > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_local_callback: Sending > full refresh (origin=crmd) > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: last-failure-srv_mysql_hermes (1394279494) > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: last-failure-srv_mysql_oltp (1394279084) > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: probe_complete (true) > Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending > flush op to all hosts > for: last-failure-srv_mysql_it (1384967023) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > st-lorien#011(isildur.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_jabberd#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_conference#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_orangutan#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > fs_jabber#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > ip_jabber#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > ip_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start > srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop) > Mar 8 09:14:01 lorien pengine[9933]: notice: process_pe_message: > Calculated Transition 53: > /var/lib/pacemaker/pengine/pe-input-50.bz2 > > - These last logs are the DC orders to start gandalf's resources in other > members. > > - Our databases executed in duplicated instances for 7 minutes, but that was > sufficient for build > some pretty wild data corruption and we had to go to complete backup restore > those. > > Can you identify some problem in this operation? We did something wrong? Is > this a known bug? I'm > thinking of something related to the first stonith kill (gandalf was DC at > that time) and some > inconsistent state after that. Also there is an election (?) just in the > moment of passing the cib > modification to other members. Could this be related? > > Many thanks in advance for any insight you could help us with. > > Kind regards! > > -- > Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador > ggo...@cooperativaobrera.coop > Gerencia de Sistemas - Cooperativa Obrera Ltda. > Tel: (0291) 403-9700 > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org