On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill <k.prosku...@corp.mail.ru> wrote: > On 10/03/2011 05:32 AM, Andrew Beekhof wrote: >>> >>> corosync-1.4.1 >>> pacemaker-1.1.5 >>> pacemaker runs with "ver: 1" > >>> 2) >>> This one is scary. >>> I twice run on situation then pacemaker thinks what resource is started >>> but >>> it is not. >> >> RA is misbehaving. Pacemaker will only consider a resource running if >> the RA tells us it is (running or in a failed state). > > But you can see below, what agent return "7".
Its still broken. Not one stop action succeeds. Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 4082) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 21859) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 24576) timed out (try 1). Killing with signal SIGTERM (15). /That/ is why pacemaker thinks its still running. > >>> We use slightly modifed version of "anything" agent for our >>> scripts but they are aware of OCF return codes and other staff. >>> >>> I run monitoring by our agent from console: >>> # env -i ; OCF_ROOT=/usr/lib/ocf >>> OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl >>> /usr/lib/ocf/resource.d/mail.ru/generic monitor >>> # generic[14992]: DEBUG: default monitor : 7 >>> >>> So our agent said what it is not running, but pacemaker still think it >>> does. >>> I runs for 2 days and after I forced to cleanup it. And it find what >>> it`snot >>> running in seconds. >> >> Did you configure a recurring monitor operation? > > Of course. I add my primitive configuration in original letter there is: > op monitor interval="30" timeout="300" on-fail="restart" \ > > I have this third time and this time I found in logs: > Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: > unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2, > magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on > mysender34.mail.ru > > There is different resource name cos logs from third situation but problem > is same. > > >>> 3) >>> This one it confusing and dangerous. >>> >>> I use failure-timeout on most resources to wipe out temp warn messages >>> from >>> crm_verify -LV - I use it for monitoring a cluster. All works good but I >>> found this: >>> >>> 1) Resource can`t start on node and migrate to next one. >>> 2) It can`t start here too and on all other. >>> 3) It is give up and stops. There is many erros about all this in >>> crm_verify >>> -LV - and it is good. >>> 4) failure-timeout comes and... wipe out all errors. >>> 5) We have stopped resource and all errors are wiped. And we don`t know >>> if >>> it is stopped by a hands of admin or because of errors. > >>> I think what failure-timeout should not happend on stopped resource. >>> Any chance to avoid this? No. > >> Not sure why you think this is dangerous, the cluster is doing exactly >> what you told it to. >> If you want resources to stay stopped either set failure-timeout=0 >> (disabled) or set the target-role to Stopped. > > No, I want to use failure-timeout but not wipe out errors then resource are > already stopped by pacemaker because of errors and not by admin hands. > > -- > Best regards, > Proskurin Kirill > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker