On 20 Jun 2014, at 11:29 pm, Gianluca Cecchi <gianluca.cec...@gmail.com> wrote:

> Hello,
> when the monitor action for a resource times out I think its failcount is 
> incremented by 1, correct?
> If so, suppose the next monitor action succeeds, does the failcount value 
> automatically resets to zero or does it stay to 1?
> In the last case, is there any way to configure the cluster to automatically 
> reset it when the following scheduled monitor completes ok? or is it a job 
> for the administrator to monitor failcount (eg with crm_mon output) and then 
> cleanup resource after checking all is ok and resetting so the failcount 
> value?
> 
> I ask because on a SLES 11 SP2 cluster from which I only got the logs I have 
> these kind of messages:
> 
> Jun 15 00:01:18 node2 pengine: [4330]: notice: common_apply_stickiness: 
> my_resource can fail 1 more times on node2 before being forced off
> ...
> Jun 15 03:38:42 node2 lrmd: [4328]: WARN: my_resource:monitor process (PID 
> 27120) timed out (try 1).  Killing with signal SIGTERM (15).
> Jun 15 03:38:42 node2 lrmd: [4328]: WARN: operation monitor[29] on 
> my_resource for client 4331: pid 27120 timed out
> Jun 15 03:38:42 node2 crmd: [4331]: ERROR: process_lrm_event: LRM operation 
> my_resource_monitor_30000 (29) Timed Out (timeout=60000ms)
> Jun 15 03:38:42 node2 crmd: [4331]: info: process_graph_event: Detected 
> action my_resource_monitor_30000 from a different transition: 40696 vs. 51755
> Jun 15 03:38:42 node2 crmd: [4331]: WARN: update_failcount: Updating 
> failcount for my_resource on node2 after failed monitor: rc=-2 
> (update=value++, time=1402796322)
> ...
> Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-my_resource (3)
> ..
> Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_perform_update: Sent 
> update 52: fail-count-my_resource=3
> ..
> Jun 15 03:38:42 node2 pengine: [4330]: WARN: common_apply_stickiness: Forcing 
> my_resource away from node2 after 3 failures (max=3)
> 
> 
> SO it seems at midnight the resource already was with a failcount of 2 
> (perhaps caused by problems happened weeks ago..?) and then at 03:38 got a 
> timeout on monitoring its state and was relocated...
> 
> pacemaker is at 1.1.6-1.27.26

I don't think the automatic reset was part of 1.1.6.
The documentation you're referring to is probably SLES12 specific.

> and I see this list message that seems related:
> http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html
> 
> Is it perhaps only a matter of setting meta parameter
> failure-timeout
> as explained in High AvailabilityGuide:
> https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc
> 
> in particular
> 5.3.6. Specifying Resource Failover Nodes
> ...
> 4. If you want to automatically expire the failcount for a resource, add the 
> failure-timeout meta attribute to the resource as described in Procedure 5.4: 
> Adding Primitive Resources, Step 7 and enter a Value for the failure-timeout.
> ..
> ?
> 
> Thanks in advance,
> Gianluca
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to