On 20 Jun 2014, at 11:29 pm, Gianluca Cecchi <gianluca.cec...@gmail.com> wrote:
> Hello, > when the monitor action for a resource times out I think its failcount is > incremented by 1, correct? > If so, suppose the next monitor action succeeds, does the failcount value > automatically resets to zero or does it stay to 1? > In the last case, is there any way to configure the cluster to automatically > reset it when the following scheduled monitor completes ok? or is it a job > for the administrator to monitor failcount (eg with crm_mon output) and then > cleanup resource after checking all is ok and resetting so the failcount > value? > > I ask because on a SLES 11 SP2 cluster from which I only got the logs I have > these kind of messages: > > Jun 15 00:01:18 node2 pengine: [4330]: notice: common_apply_stickiness: > my_resource can fail 1 more times on node2 before being forced off > ... > Jun 15 03:38:42 node2 lrmd: [4328]: WARN: my_resource:monitor process (PID > 27120) timed out (try 1). Killing with signal SIGTERM (15). > Jun 15 03:38:42 node2 lrmd: [4328]: WARN: operation monitor[29] on > my_resource for client 4331: pid 27120 timed out > Jun 15 03:38:42 node2 crmd: [4331]: ERROR: process_lrm_event: LRM operation > my_resource_monitor_30000 (29) Timed Out (timeout=60000ms) > Jun 15 03:38:42 node2 crmd: [4331]: info: process_graph_event: Detected > action my_resource_monitor_30000 from a different transition: 40696 vs. 51755 > Jun 15 03:38:42 node2 crmd: [4331]: WARN: update_failcount: Updating > failcount for my_resource on node2 after failed monitor: rc=-2 > (update=value++, time=1402796322) > ... > Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-my_resource (3) > .. > Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_perform_update: Sent > update 52: fail-count-my_resource=3 > .. > Jun 15 03:38:42 node2 pengine: [4330]: WARN: common_apply_stickiness: Forcing > my_resource away from node2 after 3 failures (max=3) > > > SO it seems at midnight the resource already was with a failcount of 2 > (perhaps caused by problems happened weeks ago..?) and then at 03:38 got a > timeout on monitoring its state and was relocated... > > pacemaker is at 1.1.6-1.27.26 I don't think the automatic reset was part of 1.1.6. The documentation you're referring to is probably SLES12 specific. > and I see this list message that seems related: > http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html > > Is it perhaps only a matter of setting meta parameter > failure-timeout > as explained in High AvailabilityGuide: > https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc > > in particular > 5.3.6. Specifying Resource Failover Nodes > ... > 4. If you want to automatically expire the failcount for a resource, add the > failure-timeout meta attribute to the resource as described in Procedure 5.4: > Adding Primitive Resources, Step 7 and enter a Value for the failure-timeout. > .. > ? > > Thanks in advance, > Gianluca > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org