On 5 Sep 2014, at 2:22 pm, renayama19661...@ybb.ne.jp wrote: > Hi All, > > We confirmed that lrmd caused the time-out of the monitor when the time of > the system was revised. > When a system considers revision of the time when I used ntpd, it is a > problem very much. > > We can confirm this problem in the next procedure. > > Step1) Start Pacemaker in a single node. > [root@snmp1 ~]# start pacemaker.combined > pacemaker.combined start/running, process 11382 > > Step2) Send simple crm. > > --------trac2915-3.crm------------ > primitive prmDummyA ocf:pacemaker:Dummy1 \ > op start interval="0s" timeout="60s" on-fail="restart" \ > op monitor interval="10s" timeout="30s" on-fail="restart" \ > op stop interval="0s" timeout="60s" on-fail="block" > group grpA prmDummyA > location rsc_location-grpA-1 grpA \ > rule $id="rsc_location-grpA-1-rule" 200: #uname eq snmp1 \ > rule $id="rsc_location-grpA-1-rule-0" 100: #uname eq snmp2 > > property $id="cib-bootstrap-options" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > crmd-transition-delay="2s" > rsc_defaults $id="rsc-options" \ > resource-stickiness="INFINITY" \ > migration-threshold="1" > ---------------------------------- > > [root@snmp1 ~]# crm configure load update trac2915-3.crm > WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist > > [root@snmp1 ~]# crm_mon -1 -Af > Last updated: Fri Sep 5 13:09:45 2014 > Last change: Fri Sep 5 13:09:13 2014 > Stack: corosync > Current DC: snmp1 (3232238180) - partition WITHOUT quorum > Version: 1.1.12-561c4cf > 1 Nodes configured > 1 Resources configured > > > Online: [ snmp1 ] > > Resource Group: grpA > prmDummyA (ocf::pacemaker:Dummy1): Started snmp1 > > Node Attributes: > * Node snmp1: > > Migration summary: > * Node snmp1: > > Step3) After the monitor of the resource just began, we push forward time > than the timeout(timeout="30s") of the monitor. > [root@snmp1 ~]# date -s +40sec > Fri Sep 5 13:11:04 JST 2014 > > Step4) The time-out of the monitor occurs. > > [root@snmp1 ~]# crm_mon -1 -Af > Last updated: Fri Sep 5 13:11:24 2014 > Last change: Fri Sep 5 13:09:13 2014 > Stack: corosync > Current DC: snmp1 (3232238180) - partition WITHOUT quorum > Version: 1.1.12-561c4cf > 1 Nodes configured > 1 Resources configured > > > Online: [ snmp1 ] > > > Node Attributes: > * Node snmp1: > > Migration summary: > * Node snmp1: > prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 > 13:11:04 2014' > > Failed actions: > prmDummyA_monitor_10000 on snmp1 'unknown error' (1): call=7, > status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, > exec=0ms > > > I confirmed some problems, but seem to be caused by the fact that an event > occurs somehow or other in g_main_loop of lrmd in the period when it is > shorter than a monitor.
So if you create a trivial program with g_main_loop and a timer, and then change the system time, does the timer expire early? > > This problem does not seem to happen somehow or other in lrmd of PM1.0. cluster-glue was probably using custom timeout code.
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org