I've solved the problem. When I set cluster-recheck-interval to a value less than failure-timeout it works.
Is this an expected behavior? This is not documented anywhere. Neither here http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html nor here http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html Regards Frank Am 28.01.2014 14:44, schrieb Frank Brendel: > No one with an idea? > Or can someone tell me if it is even possible? > > > Thanks > Frank > > > Am 23.01.2014 10:50, schrieb Frank Brendel: >> Hi list, >> >> I have some trouble configuring a resource that is allowed to fail >> once in two minutes. >> The documentation states that I have to configure migration-threshold >> and failure-timeout to achieve this. >> Here is the configuration for the resource. >> >> # pcs config >> Cluster Name: mycluster >> Corosync Nodes: >> >> Pacemaker Nodes: >> Node1 Node2 Node3 >> >> Resources: >> Clone: resClamd-clone >> Meta Attrs: clone-max=3 clone-node-max=1 interleave=true >> Resource: resClamd (class=lsb type=clamd) >> Meta Attrs: failure-timeout=120s migration-threshold=2 >> Operations: monitor on-fail=restart interval=60s >> (resClamd-monitor-on-fail-restart) >> >> Stonith Devices: >> Fencing Levels: >> >> Location Constraints: >> Ordering Constraints: >> Colocation Constraints: >> >> Cluster Properties: >> cluster-infrastructure: cman >> dc-version: 1.1.10-14.el6_5.1-368c726 >> last-lrm-refresh: 1390468150 >> stonith-enabled: false >> >> # pcs resource defaults >> resource-stickiness: INFINITY >> >> # pcs status >> Cluster name: mycluster >> Last updated: Thu Jan 23 10:12:49 2014 >> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2 >> Stack: cman >> Current DC: Node2 - partition with quorum >> Version: 1.1.10-14.el6_5.1-368c726 >> 3 Nodes configured >> 3 Resources configured >> >> >> Online: [ Node1 Node2 Node3 ] >> >> Full list of resources: >> >> Clone Set: resClamd-clone [resClamd] >> Started: [ Node1 Node2 Node3 ] >> >> >> Stopping the clamd daemon sets the failcount to 1 and the daemon is >> started again. Ok. >> >> >> # service clamd stop >> Stopping Clam AntiVirus Daemon: [ OK ] >> >> /var/log/messages >> Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event: >> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ] >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update >> relayed from Node2 >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: fail-count-resClamd (1) >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: >> Sent update 177: fail-count-resClamd=1 >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update >> relayed from Node2 >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: last-failure-resClamd (1390468520) >> Jan 23 10:15:20 Node1 attrd[6073]: notice: attrd_perform_update: >> Sent update 179: last-failure-resClamd=1390468520 >> Jan 23 10:15:20 Node1 crmd[6075]: notice: process_lrm_event: >> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ] >> Jan 23 10:15:21 Node1 crmd[6075]: notice: process_lrm_event: LRM >> operation resClamd_stop_0 (call=310, rc=0, cib-update=110, >> confirmed=true) ok >> Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: >> LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111, >> confirmed=true) ok >> Jan 23 10:15:30 elmailtst1 crmd[6075]: notice: process_lrm_event: >> LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112, >> confirmed=false) ok >> >> # pcs status >> Cluster name: mycluster >> Last updated: Thu Jan 23 10:16:48 2014 >> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1 >> Stack: cman >> Current DC: Node2 - partition with quorum >> Version: 1.1.10-14.el6_5.1-368c726 >> 3 Nodes configured >> 3 Resources configured >> >> >> Online: [ Node1 Node2 Node3 ] >> >> Full list of resources: >> >> Clone Set: resClamd-clone [resClamd] >> Started: [ Node1 Node2 Node3 ] >> >> Failed actions: >> resClamd_monitor_60000 on Node1 'not running' (7): call=305, >> status=complete, last-rc-change='Thu Jan 23 10:15:20 2014', >> queued=0ms, exec=0ms >> >> # pcs resource failcount show resClamd >> Failcounts for resClamd >> Node1: 1 >> >> >> After 7 Minutes I let it fail again and as I understood it should be >> started as well. But it doesn't. >> >> >> # service clamd stop >> Stopping Clam AntiVirus Daemon: [ OK ] >> >> Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM >> operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113, >> confirmed=false) not running >> Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: >> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ] >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update >> relayed from Node2 >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: fail-count-resClamd (2) >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: >> Sent update 181: fail-count-resClamd=2 >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_cs_dispatch: Update >> relayed from Node2 >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: last-failure-resClamd (1390468950) >> Jan 23 10:22:30 Node1 attrd[6073]: notice: attrd_perform_update: >> Sent update 183: last-failure-resClamd=1390468950 >> Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: >> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ] >> Jan 23 10:22:30 Node1 crmd[6075]: notice: process_lrm_event: LRM >> operation resClamd_stop_0 (call=322, rc=0, cib-update=114, >> confirmed=true) ok >> >> # pcs status >> Cluster name: mycluster >> Last updated: Thu Jan 23 10:22:41 2014 >> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1 >> Stack: cman >> Current DC: Node2 - partition with quorum >> Version: 1.1.10-14.el6_5.1-368c726 >> 3 Nodes configured >> 3 Resources configured >> >> >> Online: [ Node1 Node2 Node3 ] >> >> Full list of resources: >> >> Clone Set: resClamd-clone [resClamd] >> Started: [ Node2 Node3 ] >> Stopped: [ Node1 ] >> >> Failed actions: >> resClamd_monitor_60000 on Node1 'not running' (7): call=317, >> status=complete, last-rc-change='Thu Jan 23 10:22:30 2014', >> queued=0ms, exec=0ms >> >> >> What's wrong with my configuration? >> >> >> Thanks in advance >> Frank >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org