On 06/09/2013, at 1:23 AM, David Coulson <da...@davidcoulson.net> wrote:
> We patched and rebooted one of our clusters this morning - I verified that > pacemaker is the same as previous, plus it matches another similar cluster. > > There is a resource in the cluster defined as: > > primitive re-named-reload ocf:heartbeat:anything \ > params binfile="/usr/sbin/rndc" cmdline_options="reload" > > This is the last resource in a group after the named:lsb and an ipaddr > resource, so named binds to the VIP > > After the reboot the re-named-reload resource is all screwed up. The start > seems to work, but the monitor is failing and the stop doesn't work: > > Sep 5 11:14:14 dresproddns02 lrmd[82091]: notice: operation_finished: > re-named-reload_stop_0:582081 [ /usr/lib/ocf/resource.d/heartbeat/anything: > line 60: kill: (580334) - No such process ] > Sep 5 11:14:14 dresproddns02 crmd[82092]: notice: process_lrm_event: LRM > operation re-named-reload_stop_0 (call=33446, rc=0, cib-update=11044, > confirmed=true) ok > Sep 5 11:14:15 dresproddns02 crmd[82092]: notice: process_lrm_event: LRM > operation re-named-reload_start_0 (call=33450, rc=0, cib-update=11045, > confirmed=true) ok > Sep 5 11:14:15 dresproddns02 lrmd[82091]: notice: operation_finished: > re-named-reload_monitor_60000:582121 [ > /usr/lib/ocf/resource.d/heartbeat/anything: line 60: kill: (582109) - No such > process ] > Sep 5 11:14:15 dresproddns02 crmd[82092]: notice: process_lrm_event: LRM > operation re-named-reload_monitor_60000 (call=33453, rc=1, cib-update=11046, > confirmed=false) unknown error > > The ocf-tester fails on both clusters > > ocf-tester -n reload -o binfile="/usr/sbin/rndc" -o cmdline_options="reload" > /usr/lib/ocf/resource.d/heartbeat/anything > Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything... > * rc=1: Monitoring an active resource should return 0 > * rc=1: Probing an active resource should return 0 > * Your agent does not support the notify action (optional) > * Your agent does not support the demote action (optional) > * Your agent does not support the promote action (optional) > * Your agent does not support master/slave (optional) > * rc=1: Monitoring an active resource should return 0 > * rc=1: Monitoring an active resource should return 0 > * Your agent does not support the reload action (optional) > Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 4 tests > > So, I guess the question is really - Why is it working at all on the cluster > it is working on? The rndc process doesn't hang around for more than a few > seconds, so the monitor should never really see it running. > > I did copy over the heartbeat/anything script from the working environment to > the broken one, and we have the same issue. > > Short of writing a resource that does a start and forces a rc=0 for > stop/monitor, any ideas why this is behaving the way it is? I'm guessing there is a stale pid file around, or however the pid of binfile is calculated is not smart enough. > > David > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org