Hi to all! We've a 4 node cluster and recently experienced a weird issue with Pacemaker that resulted in three database instance resources duplicated (running simultaneously in 2 nodes) and subsequent data corruption.
I've been investigating logs and could not arrive to a conclusion as to why did that happened. So I'm writing to the list with details of the event to see if someone can help me pinpoint if there was some problem with our operation of maybe we hit some bug. OS: CentOS 6.4 Pacemaker version: 1.1.8 Stack: cman Stonith enabled via DELL iDRAC in all 4 nodes Nodes: gandalf, isildur, mordor, lorien Timeline of events and logs: - A resource monitor operation times out and resources in that node (gandalf) are being stopped Mar 8 08:41:09 gandalf crmd[31561]: error: process_lrm_event: LRM operation vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms) - Stopping resources in that node (gandalf) also times out and node is being killed by stonith from other node (mordor) Mar 8 08:42:54 gandalf lrmd[31558]: warning: child_timeout_callback: vg_ifx_oltp_stop_0 process (PID 17816) timed out .... Mar 8 08:42:55 gandalf pengine[31560]: warning: unpack_rsc_op: Processing failed op stop for vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1) Mar 8 08:42:55 gandalf pengine[31560]: warning: pe_fence_node: Node gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource failure(s) Mar 8 08:43:09 mordor corosync[25977]: [TOTEM ] A processor failed, forming new configuration. Mar 8 08:43:09 mordor stonith-ng[26212]: notice: log_operation: Operation 'reboot' [4612] (call 0 from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 'st-gandalf' returned: 0 (OK) Mar 8 08:43:21 mordor corosync[25977]: [QUORUM] Members[3]: 2 3 4 Mar 8 08:43:21 mordor corosync[25977]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 8 08:43:21 mordor crmd[26216]: notice: crm_update_peer_state: cman_event_callback: Node gandalf.san01.cooperativaobrera.coop[1] - state is now lost Mar 8 08:43:21 mordor crmd[26216]: warning: check_dead_member: Our DC node (gandalf.san01.cooperativaobrera.coop) left the cluster Mar 8 08:43:21 mordor kernel: dlm: closing connection to node 1 Mar 8 08:43:21 mordor corosync[25977]: [CPG ] chosen downlist: sender r(0) ip(172.16.1.1) r(1) ip(172.16.2.1) ; members(old:4 left:1) Mar 8 08:43:21 mordor corosync[25977]: [MAIN ] Completed service synchronization, ready to provide service. Mar 8 08:43:21 mordor fenced[26045]: fencing deferred to isildur.san01.cooperativaobrera.coop Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] Mar 8 08:43:22 mordor crmd[26216]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Mar 8 08:43:22 mordor stonith-ng[26212]: notice: remote_op_done: Operation reboot of gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Peer gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561 Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Notified CMAN that 'gandalf.san01.cooperativaobrera.coop' is now fenced Mar 8 08:43:22 mordor crmd[26216]: notice: tengine_stonith_notify: Target may have been our leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>) Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.82 -> 0.513.83 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than required Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.83 -> 0.513.84 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than required Mar 8 08:43:22 mordor cib[26211]: warning: cib_process_diff: Diff 0.513.84 -> 0.513.85 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than required Mar 8 08:43:22 mordor cib[26211]: notice: cib_process_diff: Diff 0.513.85 -> 0.513.86 from lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed application of an update diff Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-srv_mysql_dss (1384966716) Mar 8 08:43:27 mordor crmd[26216]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Mar 8 08:43:27 mordor attrd[26214]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-vm_suc8300 (1384966659) Mar 8 08:43:43 mordor stonith-ng[26212]: notice: remote_op_done: Operation reboot of gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop for crmd.9...@lorien.san01.cooperativaobrera.coop.3b50e760: OK Mar 8 08:43:43 mordor crmd[26216]: notice: tengine_stonith_notify: Peer gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: OK (ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934 Mar 8 08:43:43 mordor crmd[26216]: notice: tengine_stonith_notify: Notified CMAN that 'gandalf.san01.cooperativaobrera.coop' is now fenced - Up to this point, gandalf resources are passed CORRECTLY to other cluster members. We had the cluster with 3 nodes online and 1 node offline. Everything was working ok. - After gandalf reboot is complete we checked node status, and then started pacemaker again manually and node reacquires its resources CORRECTLY. So again we had the cluster with all 4 nodes online and all resources running in their preferred nodes. Log of last resource transition back to gandalf: Mar 8 08:51:40 lorien pengine[9933]: notice: LogActions: Move fs_mysql_hermes#011(Started mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop) Mar 8 08:51:40 lorien pengine[9933]: notice: LogActions: Move ip_mysql_hermes#011(Started mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop) Mar 8 08:51:40 lorien pengine[9933]: notice: process_pe_message: Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-14.bz2 Mar 8 08:51:43 lorien crmd[9934]: notice: run_graph: Transition 16 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete Mar 8 08:51:43 lorien crmd[9934]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] - So, we decided to delete the operation monitor from the resource that caused the timeout (a LVM Volume Group) to avoid having that same problem again. We did 'crm configure edit' and deleted two op monitor, and then just after commiting that modification to the crm we experienced the problem of resource duplication: Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: --- 0.526.3 Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: Diff: +++ 0.527.1 ebf608fdccda78e755a61cdeea18a659 Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" /> Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: -- <op id="vg_ifx_dss-monitor-240" interval="240" name="monitor" timeout="120" /> Mar 8 09:13:52 lorien cib[9929]: notice: cib:diff: ++ <cib epoch="527" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7" update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" cib-last-written="Sat Mar 8 08:58:40 2014" have-quorum="1" dc-uuid="lorien.san01.cooperativaobrera.coop" /> Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 8 09:13:52 lorien crmd[9934]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Mar 8 09:13:52 lorien crmd[9934]: notice: do_dc_takeover: Marking gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as clean Mar 8 09:13:52 lorien crmd[9934]: warning: match_down_event: No match for shutdown action on gandalf.san01.cooperativaobrera.coop Mar 8 09:13:52 lorien crmd[9934]: notice: peer_update_callback: Stonith/shutdown of gandalf.san01.cooperativaobrera.coop not matched Mar 8 09:13:52 lorien crmd[9934]: notice: crm_update_peer_state: send_stonith_update: Node gandalf.san01.cooperativaobrera.coop[1] - state is now lost Mar 8 09:13:53 lorien crmd[9934]: warning: do_state_transition: 3 cluster nodes failed to respond to the join offer. Mar 8 09:13:53 lorien crmd[9934]: warning: do_log: FSA: Input I_ELECTION_DC from do_dc_join_finalize() received in state S_FINALIZE_JOIN Mar 8 09:13:53 lorien crmd[9934]: error: crmd_ais_dispatch: Recieving messages from a node we think is dead: gandalf.san01.cooperativaobrera.coop[1] Mar 8 09:13:53 lorien crmd[9934]: warning: match_down_event: No match for shutdown action on gandalf.san01.cooperativaobrera.coop Mar 8 09:13:53 lorien crmd[9934]: notice: peer_update_callback: Stonith/shutdown of gandalf.san01.cooperativaobrera.coop not matched Mar 8 09:13:53 lorien crmd[9934]: notice: do_election_count_vote: Election 6 (current: 6, owner: lorien.san01.cooperativaobrera.coop): Processed no-vote from gandalf.san01.cooperativaobrera.coop (Peer is not part of our cluster) Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_local_callback: Sending full refresh (origin=crmd) - After that, node gandalf was in "pending" state but it was running all of its preferred resources. All other 3 nodes were online and also running its preferred resources, but then cluster instructed to start gandalf's resources in other nodes, but did not ordered to stonith gandalf, so => duplication: Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-srv_mysql_hermes (1394279494) Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-srv_mysql_oltp (1394279084) Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Mar 8 09:14:00 lorien attrd[9932]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-srv_mysql_it (1384967023) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start st-lorien#011(isildur.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_jabberd#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_conference#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_orangutan#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start fs_jabber#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start ip_jabber#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start ip_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: LogActions: Start srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop) Mar 8 09:14:01 lorien pengine[9933]: notice: process_pe_message: Calculated Transition 53: /var/lib/pacemaker/pengine/pe-input-50.bz2 - These last logs are the DC orders to start gandalf's resources in other members. - Our databases executed in duplicated instances for 7 minutes, but that was sufficient for build some pretty wild data corruption and we had to go to complete backup restore those. Can you identify some problem in this operation? We did something wrong? Is this a known bug? I'm thinking of something related to the first stonith kill (gandalf was DC at that time) and some inconsistent state after that. Also there is an election (?) just in the moment of passing the cib modification to other members. Could this be related? Many thanks in advance for any insight you could help us with. Kind regards! -- Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador ggo...@cooperativaobrera.coop Gerencia de Sistemas - Cooperativa Obrera Ltda. Tel: (0291) 403-9700
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org