[Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Jarred Griggles Mon, 11 Aug 2014 11:13:37 -0700

Greetings, 

We are using pacemaker and cman in a two-node cluster with no-quorum-policy: 
ignore and stonith-enabled: false on a Centos 6 system (pacemaker related RPM 
versions are listed below).  We are seeing some bizarre (to us) behavior when a 
node is fully lost (e.g. reboot -nf ).  Here's the scenario we have:


1) Fail a resource named "some-resource" started with the 
ocf:heartbeat:anything script (or others) on node01 (in our case, it's a 
master/slave resource we're pulling observations from, but it can happen on 
normal ones).
2) Wait for Resource to recover.
3) Fail node02 (reboot -nf, or power loss)
4) When node02 recovers, we see in /var/log/messages:
  - Quorum is recovered
  - Sending flush op to all hosts for master-some-resource, 
last-failure-some-resource, probe_complete(true), fail-count-some-resource(1) 
  - pengine Processing failed op monitor for some-resource on node01: unknown 
error (1)
    * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not 
see the resource agent being called at this time, on either node.
    * Sometimes, we see other operations happen that are also not being sent to 
the RA, including stop/start
    * The resource is actually happilly running on node01 throughtout this 
whole process, so there's no reason we should be seeing this failure here. 
    * This issue is only seen on resources that had not yet been cleaned up.  
Resources that were 'clean' when both nodes were last online do not have this 
issue. 

We noticed this originally because we are using the ClusterMon RA to report on 
different types of errors, and this is giving us false positives. Any thoughts 
on configuration issues we could be having, or if this sounds like a bug in 
pacemaker somewhere? 

Thanks!

----
Versions:
ccs-0.16.2-69.el6_5.1.x86_64
clusterlib-3.0.12.1-59.el6_5.2.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64
corosync-1.4.1-17.el6_5.1.x86_64
corosynclib-1.4.1-17.el6_5.1.x86_64
fence-virt-0.2.3-15.el6.x86_64
libqb-0.16.0-2.el6.x86_64
modcluster-0.16.2-28.el6.x86_64
openais-1.1.1-7.el6.x86_64
openaislib-1.1.1-7.el6.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pcs-0.9.90-2.el6.centos.3.noarch
resource-agents-3.9.2-40.el6_5.7.x86_64
ricci-0.16.2-69.el6_5.1.x86_64

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Reply via email to