Re: [Pacemaker] clear failcount when monitor is successful?

Johan Huysmans Wed, 24 Apr 2013 04:44:08 -0700


On 24-04-13 13:24, Lars Marowsky-Bree wrote:

On 2013-04-24T10:37:24, Johan Huysmans <johan.huysm...@inuits.be> wrote:

--> start situation
* scope=status  name=fail-count-d_tomcat value=0
* depending resource group running on node
* crm_mon shows everything ok

--> a failure occurs
* scope=status  name=fail-count-d_tomcat value=1
* depending resource group stopping on node
* crm_mon shows failure

--> After 30s (= failure-timeout)
* scope=status  name=fail-count-d_tomcat value=1
* depending resource group not running on node
* crm_mon shows NO failure !!!!!

This, by itself, is not necessarily surprising. The property
"cluster-reheck-interval" defines how often the PE gets re-run, and
defaults to 15 minutes.

This is not dynamically adjusted based on failure-timeouts, and if this
feature becomes more widely used, there probably should be a "better"
way to handle/trigger these while still avoiding swamping the cluster
with empty transitions etc.

In short: right now, if you want a failure-timeout of 30s to be
meaningful, you need to set cluster-recheck-interval to something
shorter.

--> After something changes in the cluster or the recheck interval
* scope=status  name=fail-count-d_tomcat value=0
* depending resource group can run on node
* crm_mon shows no failure
* BUT my resource is still monitored and failing!

I'm not sure I perfectly get what you're saying here with the last
sentence. Did the cluster try to restart it, and it failed again, yet
the failure was ignored this time around?

The cluster didn't stop or restart my cloned resource, but it is stillmonitoring it.

Which is expected as I configured the on-fail to block.

I see that the monitor section of my ocf is executed every 15s (=monitorinterval),

and that it is still failing (returning with $OCF_ERR_GENERIC)

I find it disturbing that a resource with a failing monitor has a 0
failcount, shows ok in crm_mon and allows to run the depending
resources.

Yes, if I got that right, that would be a problem - please create a
hb_/crm_report and open a bug.


Ok, will create a crm_report containing my tests.




Regards,
     Lars



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] clear failcount when monitor is successful?

Reply via email to