Re: [Pacemaker] clear failcount when monitor is successful?

Lars Marowsky-Bree Wed, 24 Apr 2013 04:29:36 -0700

On 2013-04-24T10:37:24, Johan Huysmans <johan.huysm...@inuits.be> wrote:


> --> start situation
> * scope=status  name=fail-count-d_tomcat value=0
> * depending resource group running on node
> * crm_mon shows everything ok
> 
> --> a failure occurs
> * scope=status  name=fail-count-d_tomcat value=1
> * depending resource group stopping on node
> * crm_mon shows failure
> 
> --> After 30s (= failure-timeout)
> * scope=status  name=fail-count-d_tomcat value=1
> * depending resource group not running on node
> * crm_mon shows NO failure !!!!!

This, by itself, is not necessarily surprising. The property
"cluster-reheck-interval" defines how often the PE gets re-run, and
defaults to 15 minutes.

This is not dynamically adjusted based on failure-timeouts, and if this
feature becomes more widely used, there probably should be a "better"
way to handle/trigger these while still avoiding swamping the cluster
with empty transitions etc.

In short: right now, if you want a failure-timeout of 30s to be
meaningful, you need to set cluster-recheck-interval to something
shorter.

> --> After something changes in the cluster or the recheck interval
> * scope=status  name=fail-count-d_tomcat value=0
> * depending resource group can run on node
> * crm_mon shows no failure
> * BUT my resource is still monitored and failing!

I'm not sure I perfectly get what you're saying here with the last
sentence. Did the cluster try to restart it, and it failed again, yet
the failure was ignored this time around?

> I find it disturbing that a resource with a failing monitor has a 0
> failcount, shows ok in crm_mon and allows to run the depending
> resources.

Yes, if I got that right, that would be a problem - please create a
hb_/crm_report and open a bug.



Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] clear failcount when monitor is successful?

Reply via email to