Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.

Andrew Beekhof Sun, 07 Sep 2014 19:19:23 -0700

On 5 Sep 2014, at 2:22 pm, renayama19661...@ybb.ne.jp wrote:

> Hi All,
> 
> We confirmed that lrmd caused the time-out of the monitor when the time of 
> the system was revised.
> When a system considers revision of the time when I used ntpd, it is a 
> problem very much.
> 
> We can confirm this problem in the next procedure.
> 
> Step1) Start Pacemaker in a single node.
> [root@snmp1 ~]# start pacemaker.combined
> pacemaker.combined start/running, process 11382
> 
> Step2) Send simple crm.
> 
> --------trac2915-3.crm------------
> primitive prmDummyA ocf:pacemaker:Dummy1 \
>     op start interval="0s" timeout="60s" on-fail="restart" \
>     op monitor interval="10s" timeout="30s" on-fail="restart" \
>     op stop interval="0s" timeout="60s" on-fail="block"
> group grpA prmDummyA
> location rsc_location-grpA-1 grpA \
>     rule $id="rsc_location-grpA-1-rule"   200: #uname eq snmp1 \
>     rule $id="rsc_location-grpA-1-rule-0" 100: #uname eq snmp2
> 
> property $id="cib-bootstrap-options" \
>     no-quorum-policy="ignore" \
>     stonith-enabled="false" \
>     crmd-transition-delay="2s"
> rsc_defaults $id="rsc-options" \
>     resource-stickiness="INFINITY" \
>     migration-threshold="1"
> ----------------------------------
> 
> [root@snmp1 ~]# crm configure load update trac2915-3.crm 
> WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist
> 
> [root@snmp1 ~]# crm_mon -1 -Af
> Last updated: Fri Sep  5 13:09:45 2014
> Last change: Fri Sep  5 13:09:13 2014
> Stack: corosync
> Current DC: snmp1 (3232238180) - partition WITHOUT quorum
> Version: 1.1.12-561c4cf
> 1 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ snmp1 ]
> 
>  Resource Group: grpA
>      prmDummyA  (ocf::pacemaker:Dummy1):        Started snmp1 
> 
> Node Attributes:
> * Node snmp1:
> 
> Migration summary:
> * Node snmp1: 
> 
> Step3) After the monitor of the resource just began, we push forward time 
> than the timeout(timeout="30s") of the monitor.
> [root@snmp1 ~]#  date -s +40sec
> Fri Sep  5 13:11:04 JST 2014
> 
> Step4) The time-out of the monitor occurs.
> 
> [root@snmp1 ~]# crm_mon -1 -Af
> Last updated: Fri Sep  5 13:11:24 2014
> Last change: Fri Sep  5 13:09:13 2014
> Stack: corosync
> Current DC: snmp1 (3232238180) - partition WITHOUT quorum
> Version: 1.1.12-561c4cf
> 1 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ snmp1 ]
> 
> 
> Node Attributes:
> * Node snmp1:
> 
> Migration summary:
> * Node snmp1: 
>    prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep  5 
> 13:11:04 2014'
> 
> Failed actions:
>     prmDummyA_monitor_10000 on snmp1 'unknown error' (1): call=7, 
> status=Timed Out, last-rc-change='Fri Sep  5 13:11:04 2014', queued=0ms, 
> exec=0ms
> 
> 
> I confirmed some problems, but seem to be caused by the fact that an event 
> occurs somehow or other in g_main_loop of lrmd in the period when it is 
> shorter than a monitor.


So if you create a trivial program with g_main_loop and a timer, and then 
change the system time, does the timer expire early?

> 
> This problem does not seem to happen somehow or other in lrmd of PM1.0.

cluster-glue was probably using custom timeout code.

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.

Reply via email to