On Fri, Sep 14, 2012 at 7:26 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote: > Hi Andrew, > > I confirmed that this problem had been resolved. > - ClusterLabs/pacemaker : 7a9bf21cfc > > However, I found two problems.
Ah, I see what you mean. I believe https://github.com/beekhof/pacemaker/commit/7ecc279 should fix both problems. Can you confirm please? > > (1) it is output with orphan in crm_mon. > > # crm_mon -rf1 > : > Full list of resources: > > Master/Slave Set: msAP [prmAP] > Stopped: [ prmAP:0 prmAP:1 ] > > Migration summary: > * Node vm5: > prmAP: orphan > * Node vm6: > prmAP: orphan > > Failed actions: > prmAP_monitor_10000 (node=vm5, call=15, rc=1, status=complete): > unknown error > prmAP_monitor_10000 (node=vm6, call=21, rc=1, status=complete): > unknown error > > (2) and, cannot clear the failure status. > > CIB is not updated even if I execute a 'crm_resource -C'. > > # crm_resource -C -r msAP > Cleaning up prmAP:0 on vm5 > Cleaning up prmAP:0 on vm6 > Cleaning up prmAP:1 on vm5 > Cleaning up prmAP:1 on vm6 > Waiting for 1 replies from the CRMd. OK > > # cibadmin -Q -o status > <status> > <node_state id="2439358656" uname="vm5" in_ccm="true" crmd="online" > join="member" expected="member" crm-debug-origin="do_update_resource"> > <transient_attributes id="2439358656"> > <instance_attributes id="status-2439358656"> > <nvpair id="status-2439358656-probe_complete" > name="probe_complete" value="true"/> > <nvpair id="status-2439358656-fail-count-prmAP" > name="fail-count-prmAP" value="1"/> > <nvpair id="status-2439358656-last-failure-prmAP" > name="last-failure-prmAP" value="1347598951"/> > </instance_attributes> > </transient_attributes> > <lrm id="2439358656"> > <lrm_resources> > <lrm_resource id="prmAP" type="Stateful" class="ocf" > provider="pacemaker"> > <lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0" > operation="stop" crm-debug-origin="do_update_resource" > crm_feature_set="3.0.6" > transition-key="1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:0;1:5:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="24" rc-code="0" op-status="0" interval="0" last-run="1347598936" > last-rc-change="0" exec-time="205" queue-time="0" > op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> > <lrm_rsc_op id="prmAP_monitor_10000" > operation_key="prmAP_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" > transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:8;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="15" rc-code="8" op-status="0" interval="10000" > last-rc-change="1347598916" exec-time="40" queue-time="0" > op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> > <lrm_rsc_op id="prmAP_last_failure_0" > operation_key="prmAP_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" > transition-key="10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:1;10:3:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="15" rc-code="1" op-status="0" interval="10000" > last-rc-change="1347598936" exec-time="0" queue-time="0" > op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> > </lrm_resource> > </lrm_resources> > </lrm> > </node_state> > <node_state id="2456135872" uname="vm6" in_ccm="true" crmd="online" > join="member" expected="member" crm-debug-origin="do_update_resource"> > <transient_attributes id="2456135872"> > <instance_attributes id="status-2456135872"> > <nvpair id="status-2456135872-probe_complete" > name="probe_complete" value="true"/> > <nvpair id="status-2456135872-fail-count-prmAP" > name="fail-count-prmAP" value="1"/> > <nvpair id="status-2456135872-last-failure-prmAP" > name="last-failure-prmAP" value="1347598962"/> > </instance_attributes> > </transient_attributes> > <lrm id="2456135872"> > <lrm_resources> > <lrm_resource id="prmAP" type="Stateful" class="ocf" > provider="pacemaker"> > <lrm_rsc_op id="prmAP_last_0" operation_key="prmAP_stop_0" > operation="stop" crm-debug-origin="do_update_resource" > crm_feature_set="3.0.6" > transition-key="1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:0;1:9:0:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="30" rc-code="0" op-status="0" interval="0" last-run="1347598962" > last-rc-change="0" exec-time="230" queue-time="0" > op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> > <lrm_rsc_op id="prmAP_monitor_10000" > operation_key="prmAP_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" > transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:8;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="21" rc-code="8" op-status="0" interval="10000" > last-rc-change="1347598952" exec-time="43" queue-time="0" > op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> > <lrm_rsc_op id="prmAP_last_failure_0" > operation_key="prmAP_monitor_10000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" > transition-key="9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > transition-magic="0:1;9:7:8:2935833e-7e6f-4931-9da8-f13f7de7aafc" > call-id="21" rc-code="1" op-status="0" interval="10000" > last-rc-change="1347598962" exec-time="0" queue-time="0" > op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> > </lrm_resource> > </lrm_resources> > </lrm> > </node_state> > </status> > > > I wrote a patch for crm_mon and crm_resource. > (I am not checking whether other commands have the similar problem..) > > https://github.com/inouekazu/pacemaker/commit/36cf730751080de197438cfaa34163150059d89c > > - when searching the data of a resource_s structure, resource-id which > attached instance number (:0) is used as a key as needed. > - resource-id which removed instance number is used for the update > request to CIB. > > Are the specifications (approach) of this patch right? > > Best Regards, > Kazunori INOUE > > > (12.09.11 20:17), Andrew Beekhof wrote: >> >> On Tue, Sep 11, 2012 at 9:13 PM, Andrew Beekhof <and...@beekhof.net> >> wrote: >>> >>> Yikes! >>> >>> Fixed in: >>> https://github.com/beekhof/pacemaker/commit/7d098ce >> >> >> That link should have been: >> >> >> https://github.com/beekhof/pacemaker/commit/c1f409baaaf388d03f6124ec0d9da440445c4a23 >> >>> >>> On Fri, Sep 7, 2012 at 7:49 PM, Kazunori INOUE >>> <inouek...@intellilink.co.jp> wrote: >>>> >>>> Hi, >>>> >>>> I am using Pacemaker-1.1. >>>> - ClusterLabs/pacemaker : 872a2f1af1 (Sep 07) >>>> >>>> Though a monitor of master resource fails and there is no node which >>>> the master/slave resource can run, the master/slave resource does not >>>> stop. >>>> >>>> [test case] >>>> 1. use StatefulRA which set on-fail="restart" of monitor and >>>> migration-threshold is 1. >>>> >>>> # crm_mon >>>> >>>> Online: [ vm5 vm6 ] >>>> >>>> Master/Slave Set: msAP [prmAP] >>>> Masters: [ vm5 ] >>>> Slaves: [ vm6 ] >>>> >>>> 2. let the master resource on vm5 fail, and move it to vm6. >>>> >>>> Online: [ vm5 vm6 ] >>>> >>>> Master/Slave Set: msAP [prmAP] >>>> Masters: [ vm6 ] >>>> Stopped: [ prmAP:1 ] >>>> >>>> Failed actions: >>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): >>>> unknown error >>>> >>>> 3. let the master resource on vm6 fail again, then >>>> the master/slave resource tries start repeatedly. >>>> the state of following (a) and (b) is repeated. >>>> >>>> (a) >>>> Online: [ vm5 vm6 ] >>>> >>>> >>>> Failed actions: >>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): >>>> unknown error >>>> prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete): >>>> unknown error >>>> >>>> (b) >>>> Online: [ vm5 vm6 ] >>>> >>>> Master/Slave Set: msAP [prmAP] >>>> Slaves: [ vm5 vm6 ] >>>> >>>> Failed actions: >>>> prmAP_monitor_10000 (node=vm5, call=14, rc=1, status=complete): >>>> unknown error >>>> prmAP_monitor_10000 (node=vm6, call=20, rc=1, status=complete): >>>> unknown error >>>> >>>> # grep -e run_graph: -e common_apply_stickiness: -e LogActions: ha-log >>>> >>>>>> after the master resource on vm5 failed >>>> >>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Recover >>>> prmAP:0 (Master vm5) >>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 4 >>>> (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3, >>>> Source=/var/lib/pacemaker/pengine/pe-input-4.bz2): Stopped >>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Stop >>>> prmAP:0 (vm5) >>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Promote >>>> prmAP:1 (Slave -> Master vm6) >>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 5 >>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, >>>> Source=/var/lib/pacemaker/pengine/pe-input-5.bz2): Stopped >>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:03 vm5 pengine[23199]: notice: LogActions: Promote >>>> prmAP:0 (Slave -> Master vm6) >>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 6 >>>> (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, >>>> Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Stopped >>>> Sep 7 16:06:03 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:03 vm5 crmd[23200]: notice: run_graph: Transition 7 >>>> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, >>>> Source=/var/lib/pacemaker/pengine/pe-input-7.bz2): Complete >>>> >>>>>> after the master resource on vm6 failed >>>> >>>> Sep 7 16:06:33 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:33 vm5 pengine[23199]: notice: LogActions: Recover >>>> prmAP:0 (Master vm6) >>>> Sep 7 16:06:34 vm5 crmd[23200]: notice: run_graph: Transition 8 >>>> (Complete=3, Pending=0, Fired=0, Skipped=8, Incomplete=3, >>>> Source=/var/lib/pacemaker/pengine/pe-input-8.bz2): Stopped >>>> Sep 7 16:06:34 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:34 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm6 after 1 failures (max=1) >>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Stop >>>> prmAP:0 (vm6) >>>> Sep 7 16:06:34 vm5 crmd[23200]: notice: run_graph: Transition 9 >>>> (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, >>>> Source=/var/lib/pacemaker/pengine/pe-input-9.bz2): Stopped >>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Start >>>> prmAP:0 (vm5) >>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Promote >>>> prmAP:0 (Stopped -> Master vm5) >>>> Sep 7 16:06:34 vm5 pengine[23199]: notice: LogActions: Start >>>> prmAP:1 (vm6) >>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 10 >>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, >>>> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped >>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm5 after 1 failures (max=1) >>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm6 after 1 failures (max=1) >>>> Sep 7 16:06:35 vm5 pengine[23199]: warning: common_apply_stickiness: >>>> Forcing msAP away from vm6 after 1 failures (max=1) >>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Stop >>>> prmAP:0 (vm5) >>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Stop >>>> prmAP:1 (vm6) >>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 11 >>>> (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=0, >>>> Source=/var/lib/pacemaker/pengine/pe-input-11.bz2): Stopped >>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Start >>>> prmAP:0 (vm5) >>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Promote >>>> prmAP:0 (Stopped -> Master vm5) >>>> Sep 7 16:06:35 vm5 pengine[23199]: notice: LogActions: Start >>>> prmAP:1 (vm6) >>>> Sep 7 16:06:35 vm5 crmd[23200]: notice: run_graph: Transition 12 >>>> (Complete=4, Pending=0, Fired=0, Skipped=4, Incomplete=1, >>>> Source=/var/lib/pacemaker/pengine/pe-input-12.bz2): Stopped >>>> : >>>> >>>> Is it a known issue? >>>> >>>> Best Regards, >>>> Kazunori INOUE >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org