Re: [Pacemaker] Restart of resources

Andrew Beekhof Thu, 06 Feb 2014 19:58:34 -0800

On 3 Feb 2014, at 9:40 pm, Frank Brendel <[email protected]> wrote:


> I've solved the problem.
> 
> When I set cluster-recheck-interval to a value less than failure-timeout 
> it works.
> 
> Is this an expected behavior?

Yes.

> 
> This is not documented anywhere.

Its somewhat inferred in the description of cluster-recheck-interval

       cluster-recheck-interval = time [15min]
           Polling interval for time based changes to options, resource 
parameters and constraints.

           The Cluster is primarily event driven, however the configuration can 
have elements that change based on time. To ensure these changes take effect, 
we can optionally poll the cluster's status for changes. Allowed values: Zero 
disables polling. Positive values are an interval in seconds (unless other SI 
units are specified.
           eg. 5min)

the failure-timeout doesn't result in any events on its own, so the 
reprocessing happens the next time the PE gets kicked by the recheck timer.

> Neither here 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html
> nor here 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html
> 
> 
> Regards
> Frank
> 
> 
> Am 28.01.2014 14:44, schrieb Frank Brendel:
>> No one with an idea?
>> Or can someone tell me if it is even possible?
>> 
>> 
>> Thanks
>> Frank
>> 
>> 
>> Am 23.01.2014 10:50, schrieb Frank Brendel:
>>> Hi list,
>>> 
>>> I have some trouble configuring a resource that is allowed to fail
>>> once in two minutes.
>>> The documentation states that I have to configure migration-threshold
>>> and failure-timeout to achieve this.
>>> Here is the configuration for the resource.
>>> 
>>> # pcs config
>>> Cluster Name: mycluster
>>> Corosync Nodes:
>>> 
>>> Pacemaker Nodes:
>>>  Node1 Node2 Node3
>>> 
>>> Resources:
>>>  Clone: resClamd-clone
>>>   Meta Attrs: clone-max=3 clone-node-max=1 interleave=true
>>>   Resource: resClamd (class=lsb type=clamd)
>>>    Meta Attrs: failure-timeout=120s migration-threshold=2
>>>    Operations: monitor on-fail=restart interval=60s
>>> (resClamd-monitor-on-fail-restart)
>>> 
>>> Stonith Devices:
>>> Fencing Levels:
>>> 
>>> Location Constraints:
>>> Ordering Constraints:
>>> Colocation Constraints:
>>> 
>>> Cluster Properties:
>>>  cluster-infrastructure: cman
>>>  dc-version: 1.1.10-14.el6_5.1-368c726
>>>  last-lrm-refresh: 1390468150
>>>  stonith-enabled: false
>>> 
>>> # pcs resource defaults
>>> resource-stickiness: INFINITY
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:12:49 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node2
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node1 Node2 Node3 ]
>>> 
>>> 
>>> Stopping the clamd daemon sets the failcount to 1 and the daemon is
>>> started again. Ok.
>>> 
>>> 
>>> # service clamd stop
>>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>> 
>>> /var/log/messages
>>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: fail-count-resClamd (1)
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 177: fail-count-resClamd=1
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-resClamd (1390468520)
>>> Jan 23 10:15:20 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 179: last-failure-resClamd=1390468520
>>> Jan 23 10:15:20 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:305 [ clamd is stopped\n ]
>>> Jan 23 10:15:21 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_stop_0 (call=310, rc=0, cib-update=110,
>>> confirmed=true) ok
>>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>>> LRM operation resClamd_start_0 (call=314, rc=0, cib-update=111,
>>> confirmed=true) ok
>>> Jan 23 10:15:30 elmailtst1 crmd[6075]:   notice: process_lrm_event:
>>> LRM operation resClamd_monitor_60000 (call=317, rc=0, cib-update=112,
>>> confirmed=false) ok
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:16:48 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node1 Node2 Node3 ]
>>> 
>>> Failed actions:
>>>     resClamd_monitor_60000 on Node1 'not running' (7): call=305,
>>> status=complete, last-rc-change='Thu Jan 23 10:15:20 2014',
>>> queued=0ms, exec=0ms
>>> 
>>> # pcs resource failcount show resClamd
>>> Failcounts for resClamd
>>>  Node1: 1
>>> 
>>> 
>>> After 7 Minutes I let it fail again and as I understood it should be
>>> started as well. But it doesn't.
>>> 
>>> 
>>> # service clamd stop
>>> Stopping Clam AntiVirus Daemon:                            [  OK ]
>>> 
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_monitor_60000 (call=317, rc=7, cib-update=113,
>>> confirmed=false) not running
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: fail-count-resClamd (2)
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 181: fail-count-resClamd=2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_cs_dispatch: Update
>>> relayed from Node2
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-resClamd (1390468950)
>>> Jan 23 10:22:30 Node1 attrd[6073]:   notice: attrd_perform_update:
>>> Sent update 183: last-failure-resClamd=1390468950
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event:
>>> Node1-resClamd_monitor_60000:317 [ clamd is stopped\n ]
>>> Jan 23 10:22:30 Node1 crmd[6075]:   notice: process_lrm_event: LRM
>>> operation resClamd_stop_0 (call=322, rc=0, cib-update=114,
>>> confirmed=true) ok
>>> 
>>> # pcs status
>>> Cluster name: mycluster
>>> Last updated: Thu Jan 23 10:22:41 2014
>>> Last change: Thu Jan 23 10:11:40 2014 via cibadmin on Node1
>>> Stack: cman
>>> Current DC: Node2 - partition with quorum
>>> Version: 1.1.10-14.el6_5.1-368c726
>>> 3 Nodes configured
>>> 3 Resources configured
>>> 
>>> 
>>> Online: [ Node1 Node2 Node3 ]
>>> 
>>> Full list of resources:
>>> 
>>>  Clone Set: resClamd-clone [resClamd]
>>>      Started: [ Node2 Node3 ]
>>>      Stopped: [ Node1 ]
>>> 
>>> Failed actions:
>>>     resClamd_monitor_60000 on Node1 'not running' (7): call=317,
>>> status=complete, last-rc-change='Thu Jan 23 10:22:30 2014',
>>> queued=0ms, exec=0ms
>>> 
>>> 
>>> What's wrong with my configuration?
>>> 
>>> 
>>> Thanks in advance
>>> Frank
>>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: [email protected]
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Restart of resources

Reply via email to