On 09/07/2014 08:12 PM, Andrew Beekhof wrote: > On 5 Sep 2014, at 2:22 pm, renayama19661...@ybb.ne.jp wrote: > >> Hi All, >> >> We confirmed that lrmd caused the time-out of the monitor when the time of >> the system was revised. >> When a system considers revision of the time when I used ntpd, it is a >> problem very much. >> >> We can confirm this problem in the next procedure. >> >> Step1) Start Pacemaker in a single node. >> [root@snmp1 ~]# start pacemaker.combined >> pacemaker.combined start/running, process 11382 >> >> Step2) Send simple crm. >> >> --------trac2915-3.crm------------ >> primitive prmDummyA ocf:pacemaker:Dummy1 \ >> op start interval="0s" timeout="60s" on-fail="restart" \ >> op monitor interval="10s" timeout="30s" on-fail="restart" \ >> op stop interval="0s" timeout="60s" on-fail="block" >> group grpA prmDummyA >> location rsc_location-grpA-1 grpA \ >> rule $id="rsc_location-grpA-1-rule" 200: #uname eq snmp1 \ >> rule $id="rsc_location-grpA-1-rule-0" 100: #uname eq snmp2 >> >> property $id="cib-bootstrap-options" \ >> no-quorum-policy="ignore" \ >> stonith-enabled="false" \ >> crmd-transition-delay="2s" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="INFINITY" \ >> migration-threshold="1" >> ---------------------------------- >> >> [root@snmp1 ~]# crm configure load update trac2915-3.crm >> WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist >> >> [root@snmp1 ~]# crm_mon -1 -Af >> Last updated: Fri Sep 5 13:09:45 2014 >> Last change: Fri Sep 5 13:09:13 2014 >> Stack: corosync >> Current DC: snmp1 (3232238180) - partition WITHOUT quorum >> Version: 1.1.12-561c4cf >> 1 Nodes configured >> 1 Resources configured >> >> >> Online: [ snmp1 ] >> >> Resource Group: grpA >> prmDummyA (ocf::pacemaker:Dummy1): Started snmp1 >> >> Node Attributes: >> * Node snmp1: >> >> Migration summary: >> * Node snmp1: >> >> Step3) After the monitor of the resource just began, we push forward time >> than the timeout(timeout="30s") of the monitor. >> [root@snmp1 ~]# date -s +40sec >> Fri Sep 5 13:11:04 JST 2014 >> >> Step4) The time-out of the monitor occurs. >> >> [root@snmp1 ~]# crm_mon -1 -Af >> Last updated: Fri Sep 5 13:11:24 2014 >> Last change: Fri Sep 5 13:09:13 2014 >> Stack: corosync >> Current DC: snmp1 (3232238180) - partition WITHOUT quorum >> Version: 1.1.12-561c4cf >> 1 Nodes configured >> 1 Resources configured >> >> >> Online: [ snmp1 ] >> >> >> Node Attributes: >> * Node snmp1: >> >> Migration summary: >> * Node snmp1: >> prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 >> 13:11:04 2014' >> >> Failed actions: >> prmDummyA_monitor_10000 on snmp1 'unknown error' (1): call=7, >> status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, >> exec=0ms >> >> >> I confirmed some problems, but seem to be caused by the fact that an event >> occurs somehow or other in g_main_loop of lrmd in the period when it is >> shorter than a monitor. > So if you create a trivial program with g_main_loop and a timer, and then > change the system time, does the timer expire early? > >> This problem does not seem to happen somehow or other in lrmd of PM1.0. > cluster-glue was probably using custom timeout code. Yes it was. Exactly because of this well-known problem. I think recent versions of the glib code have fixed that. I can't tell you how many different bugs we ran into that related to timing - like this -- or time wraparound. There were probably a dozen time-related bugs. Most of them weren't in the LRM code, but the rest of the universe -- like this one.
I filed the bug against glib probably 8-10 years ago. It takes a while for things to get fixed, then it takes even longer for them to get fixed in the various distros. Some of them are 5+ years behind current code. FreeBSD had a similar problem - even with our custom code because they weren't following POSIX. But I filed the bug against them, and they fixed it (eventually). -- Alan Robertson al...@unix.sh _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org