[Pacemaker] Node in pending state, resources duplicated and data corruption

Gabriel Gomiz Fri, 14 Mar 2014 04:38:33 -0700

Hi to all!

We've a 4 node cluster and recently experienced a weird issue with Pacemaker 
that resulted in three
database instance resources duplicated (running simultaneously in 2 nodes) and 
subsequent data
corruption.


I've been investigating logs and could not arrive to a conclusion as to why did 
that happened. So
I'm writing to the list with details of the event to see if someone can help me 
pinpoint if there
was some problem with our operation of maybe we hit some bug.

OS: CentOS 6.4
Pacemaker version: 1.1.8
Stack: cman
Stonith enabled via DELL iDRAC in all 4 nodes
Nodes: gandalf, isildur, mordor, lorien

Timeline of events and logs:

- A resource monitor operation times out and resources in that node (gandalf) 
are being stopped

Mar  8 08:41:09 gandalf crmd[31561]:    error: process_lrm_event: LRM operation
vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms)

- Stopping resources in that node (gandalf) also times out and node is being 
killed by stonith from
other node (mordor)

Mar  8 08:42:54 gandalf lrmd[31558]:  warning: child_timeout_callback: 
vg_ifx_oltp_stop_0 process
(PID 17816) timed out
....
Mar  8 08:42:55 gandalf pengine[31560]:  warning: unpack_rsc_op: Processing 
failed op stop for
vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1)
Mar  8 08:42:55 gandalf pengine[31560]:  warning: pe_fence_node: Node
gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource 
failure(s)

Mar  8 08:43:09 mordor corosync[25977]:   [TOTEM ] A processor failed, forming 
new configuration.
Mar  8 08:43:09 mordor stonith-ng[26212]:   notice: log_operation: Operation 
'reboot' [4612] (call 0
from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 
'st-gandalf' returned:
0 (OK)
Mar  8 08:43:21 mordor corosync[25977]:   [QUORUM] Members[3]: 2 3 4
Mar  8 08:43:21 mordor corosync[25977]:   [TOTEM ] A processor joined or left 
the membership and a
new membership was formed.
Mar  8 08:43:21 mordor crmd[26216]:   notice: crm_update_peer_state: 
cman_event_callback: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 08:43:21 mordor crmd[26216]:  warning: check_dead_member: Our DC node
(gandalf.san01.cooperativaobrera.coop) left the cluster
Mar  8 08:43:21 mordor kernel: dlm: closing connection to node 1
Mar  8 08:43:21 mordor corosync[25977]:   [CPG   ] chosen downlist: sender r(0) 
ip(172.16.1.1) r(1)
ip(172.16.2.1) ; members(old:4 left:1)
Mar  8 08:43:21 mordor corosync[25977]:   [MAIN  ] Completed service 
synchronization, ready to
provide service.
Mar  8 08:43:21 mordor fenced[26045]: fencing deferred to 
isildur.san01.cooperativaobrera.coop
Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State 
transition S_NOT_DC ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State 
transition S_ELECTION ->
S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Mar  8 08:43:22 mordor stonith-ng[26212]:   notice: remote_op_done: Operation 
reboot of
gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for
crmd.31...@gandalf.san01.cooperativaobrera.coop.10d27664: OK
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK
(ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified 
CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target 
may have been our
leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.82 -> 
0.513.83 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
"num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.83 -> 
0.513.84 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
"num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.84 -> 
0.513.85 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current 
"num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:   notice: cib_process_diff: Diff 0.513.85 -> 
0.513.86 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed application 
of an update diff
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: last-failure-srv_mysql_dss (1384966716)
Mar  8 08:43:27 mordor crmd[26216]:   notice: do_state_transition: State 
transition S_PENDING ->
S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond 
]
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: probe_complete (true)
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: last-failure-vm_suc8300 (1384966659)
Mar  8 08:43:43 mordor stonith-ng[26212]:   notice: remote_op_done: Operation 
reboot of
gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop for
crmd.9...@lorien.san01.cooperativaobrera.coop.3b50e760: OK
Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: OK
(ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934
Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified 
CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced

- Up to this point, gandalf resources are passed CORRECTLY to other cluster 
members. We had the
cluster with 3 nodes online and 1 node offline. Everything was working ok.

- After gandalf reboot is complete we checked node status, and then started 
pacemaker again manually
and node reacquires its resources CORRECTLY. So again we had the cluster with 
all 4 nodes online and
all resources running in their preferred nodes. Log of last resource transition 
back to gandalf:

Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    
fs_mysql_hermes#011(Started
mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    
ip_mysql_hermes#011(Started
mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
Mar  8 08:51:40 lorien pengine[9933]:   notice: process_pe_message: Calculated 
Transition 16:
/var/lib/pacemaker/pengine/pe-input-14.bz2
Mar  8 08:51:43 lorien crmd[9934]:   notice: run_graph: Transition 16 
(Complete=9, Pending=0,
Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
Mar  8 08:51:43 lorien crmd[9934]:   notice: do_state_transition: State 
transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]

- So, we decided to delete the operation monitor from the resource that caused 
the timeout (a LVM
Volume Group) to avoid having that same problem again. We did 'crm configure 
edit' and deleted two
op monitor, and then just after commiting that modification to the crm we 
experienced the problem of
resource duplication:

Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 
ebf608fdccda78e755a61cdeea18a659
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op 
id="vg_ifx_dss-monitor-240"
interval="240" name="monitor" timeout="120" />
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" 
num_updates="1"
admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" 
cib-last-written="Sat Mar 
8 08:58:40 2014" have-quorum="1" dc-uuid="lorien.san01.cooperativaobrera.coop" 
/>
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State 
transition S_ELECTION ->
S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL 
origin=do_election_check ]
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as 
clean
Mar  8 09:13:52 lorien crmd[9934]:  warning: match_down_event: No match for 
shutdown action on
gandalf.san01.cooperativaobrera.coop
Mar  8 09:13:52 lorien crmd[9934]:   notice: peer_update_callback: 
Stonith/shutdown of
gandalf.san01.cooperativaobrera.coop not matched
Mar  8 09:13:52 lorien crmd[9934]:   notice: crm_update_peer_state: 
send_stonith_update: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 09:13:53 lorien crmd[9934]:  warning: do_state_transition: 3 cluster 
nodes failed to respond
to the join offer.
Mar  8 09:13:53 lorien crmd[9934]:  warning: do_log: FSA: Input I_ELECTION_DC 
from
do_dc_join_finalize() received in state S_FINALIZE_JOIN
Mar  8 09:13:53 lorien crmd[9934]:    error: crmd_ais_dispatch: Recieving 
messages from a node we
think is dead: gandalf.san01.cooperativaobrera.coop[1]
Mar  8 09:13:53 lorien crmd[9934]:  warning: match_down_event: No match for 
shutdown action on
gandalf.san01.cooperativaobrera.coop
Mar  8 09:13:53 lorien crmd[9934]:   notice: peer_update_callback: 
Stonith/shutdown of
gandalf.san01.cooperativaobrera.coop not matched
Mar  8 09:13:53 lorien crmd[9934]:   notice: do_election_count_vote: Election 6 
(current: 6, owner:
lorien.san01.cooperativaobrera.coop): Processed no-vote from 
gandalf.san01.cooperativaobrera.coop
(Peer is not part of our cluster)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)

- After that, node gandalf was in "pending" state but it was running all of its 
preferred resources.
All other 3 nodes were online and also running its preferred resources, but 
then cluster instructed
to start gandalf's resources in other nodes, but did not ordered to stonith 
gandalf, so => duplication:

Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: last-failure-srv_mysql_hermes (1394279494)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: last-failure-srv_mysql_oltp (1394279084)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: probe_complete (true)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending 
flush op to all hosts
for: last-failure-srv_mysql_it (1384967023)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
st-lorien#011(isildur.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_jabberd#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_conference#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_orangutan#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_jabber#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_jabber#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: process_pe_message: Calculated 
Transition 53:
/var/lib/pacemaker/pengine/pe-input-50.bz2

- These last logs are the DC orders to start gandalf's resources in other 
members.

- Our databases executed in duplicated instances for 7 minutes, but that was 
sufficient for build
some pretty wild data corruption and we had to go to complete backup restore 
those.

Can you identify some problem in this operation? We did something wrong? Is 
this a known bug? I'm
thinking of something related to the first stonith kill (gandalf was DC at that 
time) and some
inconsistent state after that. Also there is an election (?) just in the moment 
of passing the cib
modification to other members. Could this be related?

Many thanks in advance for any insight you could help us with.

Kind regards!

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggo...@cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Node in pending state, resources duplicated and data corruption

Reply via email to