Greetings,
We are using pacemaker and cman in a two-node cluster with no-quorum-policy:
ignore and stonith-enabled: false on a Centos 6 system (pacemaker related RPM
versions are listed below). We are seeing some bizarre (to us) behavior when a
node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
1) Fail a resource named "some-resource" started with the
ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
master/slave resource we're pulling observations from, but it can happen on
normal ones).
2) Wait for Resource to recover.
3) Fail node02 (reboot -nf, or power loss)
4) When node02 recovers, we see in /var/log/messages:
- Quorum is recovered
- Sending flush op to all hosts for master-some-resource,
last-failure-some-resource, probe_complete(true), fail-count-some-resource(1)
- pengine Processing failed op monitor for some-resource on node01: unknown
error (1)
* After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not
see the resource agent being called at this time, on either node.
* Sometimes, we see other operations happen that are also not being sent to
the RA, including stop/start
* The resource is actually happilly running on node01 throughtout this
whole process, so there's no reason we should be seeing this failure here.
* This issue is only seen on resources that had not yet been cleaned up.
Resources that were 'clean' when both nodes were last online do not have this
issue.
We noticed this originally because we are using the ClusterMon RA to report on
different types of errors, and this is giving us false positives. Any thoughts
on configuration issues we could be having, or if this sounds like a bug in
pacemaker somewhere?
Thanks!
----
Versions:
ccs-0.16.2-69.el6_5.1.x86_64
clusterlib-3.0.12.1-59.el6_5.2.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64
corosync-1.4.1-17.el6_5.1.x86_64
corosynclib-1.4.1-17.el6_5.1.x86_64
fence-virt-0.2.3-15.el6.x86_64
libqb-0.16.0-2.el6.x86_64
modcluster-0.16.2-28.el6.x86_64
openais-1.1.1-7.el6.x86_64
openaislib-1.1.1-7.el6.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pcs-0.9.90-2.el6.centos.3.noarch
resource-agents-3.9.2-40.el6_5.7.x86_64
ricci-0.16.2-69.el6_5.1.x86_64
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org