Fixed! https://github.com/beekhof/pacemaker/commit/d87de1b
On 10/05/2013, at 11:59 AM, Andrew Beekhof <and...@beekhof.net> wrote: > > On 07/05/2013, at 5:15 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote: > >> Hi, >> >> I only keep a couple of pe-input file, and that pe-inpurt-1 version was >> already overwritten. >> I redid my tests as describe in my previous mails. >> >> At the end of the test it was again written to pe-input1, which is included >> as attachment. > > Perfect. > Basically the PE doesn't know how to correctly recognise that > d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0: > > <lrm_rsc_op id="d_tomcat_monitor_15000" > operation_key="d_tomcat_monitor_15000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" > transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" > transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" > call-id="44" rc-code="0" op-status="0" interval="15000" > last-rc-change="1367910303" exec-time="0" queue-time="0" > op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> > <lrm_rsc_op id="d_tomcat_last_failure_0" > operation_key="d_tomcat_monitor_15000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" > transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" > transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" > call-id="44" rc-code="1" op-status="0" interval="15000" > last-rc-change="1367909258" exec-time="0" queue-time="0" > op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/> > > which would allow it to recognise that the resource is healthy one again. > > I'll see what I can do... > >> >> gr. >> Johan >> >> On 2013-05-07 04:08, Andrew Beekhof wrote: >>> I have a much clearer idea of the problem you're seeing now, thankyou. >>> >>> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ? >>> >>> On 03/05/2013, at 10:40 PM, Johan Huysmans <johan.huysm...@inuits.be> wrote: >>> >>>> Hi, >>>> >>>> Below you can see my setup and my test, this shows that my cloned resource >>>> with on-fail=block does not recover automatically. >>>> >>>> My Setup: >>>> >>>> # rpm -aq | grep -i pacemaker >>>> pacemaker-libs-1.1.9-1512.el6.i686 >>>> pacemaker-cluster-libs-1.1.9-1512.el6.i686 >>>> pacemaker-cli-1.1.9-1512.el6.i686 >>>> pacemaker-1.1.9-1512.el6.i686 >>>> >>>> # crm configure show >>>> node CSE-1 >>>> node CSE-2 >>>> primitive d_tomcat ocf:ntc:tomcat \ >>>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>>> op start interval="0" timeout="510s" \ >>>> params instance_name="NMS" monitor_use_ssl="no" >>>> monitor_urls="/cse/health" monitor_timeout="120" \ >>>> meta migration-threshold="1" >>>> primitive ip_11 ocf:heartbeat:IPaddr2 \ >>>> op monitor interval="10s" \ >>>> params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" >>>> iflabel="ha" \ >>>> meta migration-threshold="1" failure-timeout="10" >>>> primitive ip_19 ocf:heartbeat:IPaddr2 \ >>>> op monitor interval="10s" \ >>>> params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" >>>> iflabel="ha" \ >>>> meta migration-threshold="1" failure-timeout="10" >>>> group svc-cse ip_19 ip_11 >>>> clone cl_tomcat d_tomcat >>>> colocation colo_tomcat inf: svc-cse cl_tomcat >>>> order order_tomcat inf: cl_tomcat svc-cse >>>> property $id="cib-bootstrap-options" \ >>>> dc-version="1.1.9-1512.el6-2a917dd" \ >>>> cluster-infrastructure="cman" \ >>>> pe-warn-series-max="9" \ >>>> no-quorum-policy="ignore" \ >>>> stonith-enabled="false" \ >>>> pe-input-series-max="9" \ >>>> pe-error-series-max="9" \ >>>> last-lrm-refresh="1367582088" >>>> >>>> Currently only 1 node is available, CSE-1. >>>> >>>> >>>> This is how I am currently testing my setup: >>>> >>>> => Starting point: Everything up and running >>>> >>>> # crm resource status >>>> Resource Group: svc-cse >>>> ip_19 (ocf::heartbeat:IPaddr2): Started >>>> ip_11 (ocf::heartbeat:IPaddr2): Started >>>> Clone Set: cl_tomcat [d_tomcat] >>>> Started: [ CSE-1 ] >>>> Stopped: [ d_tomcat:1 ] >>>> >>>> => Causing failure: Change system so tomcat is running but has a failure >>>> (in attachment step_2.log) >>>> >>>> # crm resource status >>>> Resource Group: svc-cse >>>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>>> Clone Set: cl_tomcat [d_tomcat] >>>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>>> Stopped: [ d_tomcat:1 ] >>>> >>>> => Fixing failure: Revert system so tomcat is running without failure (in >>>> attachment step_3.log) >>>> >>>> # crm resource status >>>> Resource Group: svc-cse >>>> ip_19 (ocf::heartbeat:IPaddr2): Stopped >>>> ip_11 (ocf::heartbeat:IPaddr2): Stopped >>>> Clone Set: cl_tomcat [d_tomcat] >>>> d_tomcat:0 (ocf::ntc:tomcat): Started (unmanaged) FAILED >>>> Stopped: [ d_tomcat:1 ] >>>> >>>> As you can see in the logs the OCF script doesn't return any failure. This >>>> is noticed by pacemaker, >>>> however it doesn't reflect in crm_mon and it doesn't start the depending >>>> resources. >>>> >>>> Gr. >>>> Johan >>>> >>>> On 2013-05-03 03:04, Andrew Beekhof wrote: >>>>> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>> wrote: >>>>> >>>>>> On 2013-05-01 05:48, Andrew Beekhof wrote: >>>>>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysm...@inuits.be> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I'm trying to setup a specific configuration in our cluster, however >>>>>>>> I'm struggling with my configuration. >>>>>>>> >>>>>>>> This is what I'm trying to achieve: >>>>>>>> On both nodes of the cluster a daemon must be running (tomcat). >>>>>>>> Some failover addresses are configured and must be running on the node >>>>>>>> with a correctly running tomcat. >>>>>>>> >>>>>>>> I have this achieved with a cloned tomcat resource and an collocation >>>>>>>> between the cloned tomcat and the failover addresses. >>>>>>>> When I cause a failure in the tomcat on the node running the failover >>>>>>>> addresses, the failover addresses will failover to the other node as >>>>>>>> expected. >>>>>>>> crm_mon shows that this tomcat has a failure. >>>>>>>> When I configure the tomcat resource with failure-timeout=0, the >>>>>>>> failure alarm in crm_mon isn't cleared whenever the tomcat failure is >>>>>>>> fixed. >>>>>>> All sounds right so far. >>>>>> If my broken tomcat is automatically fixed, I expect this to be noticed >>>>>> by pacemaker and that that node will be able to run my failover >>>>>> addresses, >>>>>> however I don't see this happening. >>>>> This is very hard to discuss without seeing logs. >>>>> >>>>> So you created a tomcat error, waited for pacemaker to notice, fixed the >>>>> error and observed the pacemaker did not re-notice? >>>>> How long did you wait? More than the 15s repeat interval I assume? Did >>>>> at least the resource agent notice? >>>>> >>>>>>>> When I configure the tomcat resource with failure-timeout=30, the >>>>>>>> failure alarm in crm_mon is cleared after 30seconds however the tomcat >>>>>>>> is still having a failure. >>>>>>> Can you define "still having a failure"? >>>>>>> You mean it still shows up in crm_mon? >>>>>>> Have you read this link? >>>>>>> >>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html >>>>>> "Still having a failure" means that the tomcat is still broken and my >>>>>> OCF script reports it as a failure. >>>>>>>> What I expect is that pacemaker reports the failure as the failure >>>>>>>> exists and as long as it exists and that pacemaker reports that >>>>>>>> everything is ok once everything is back ok. >>>>>>>> >>>>>>>> Do I do something wrong with my configuration? >>>>>>>> Or how can I achieve my wanted setup? >>>>>>>> >>>>>>>> Here is my configuration: >>>>>>>> >>>>>>>> node CSE-1 >>>>>>>> node CSE-2 >>>>>>>> primitive d_tomcat ocf:custom:tomcat \ >>>>>>>> op monitor interval="15s" timeout="510s" on-fail="block" \ >>>>>>>> op start interval="0" timeout="510s" \ >>>>>>>> params instance_name="NMS" monitor_use_ssl="no" >>>>>>>> monitor_urls="/cse/health" monitor_timeout="120" \ >>>>>>>> meta migration-threshold="1" failure-timeout="0" >>>>>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \ >>>>>>>> op monitor interval="10s" \ >>>>>>>> params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1" >>>>>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \ >>>>>>>> op monitor interval="10s" \ >>>>>>>> params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2" >>>>>>>> group svc-cse ip_1 ip_2 >>>>>>>> clone cl_tomcat d_tomcat >>>>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat >>>>>>>> order order_tomcat inf: cl_tomcat svc-cse >>>>>>>> property $id="cib-bootstrap-options" \ >>>>>>>> dc-version="1.1.8-7.el6-394e906" \ >>>>>>>> cluster-infrastructure="cman" \ >>>>>>>> no-quorum-policy="ignore" \ >>>>>>>> stonith-enabled="false" >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Greetings, >>>>>>>> Johan Huysmans >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> <step_2.log><step_3.log>_______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> <pe-input-1.bz2>_______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org