On 06/09/2013, at 1:23 AM, David Coulson <da...@davidcoulson.net> wrote:

> We patched and rebooted one of our clusters this morning - I verified that 
> pacemaker is the same as previous, plus it matches another similar cluster.
> 
> There is a resource in the cluster defined as:
> 
> primitive re-named-reload ocf:heartbeat:anything \
>        params binfile="/usr/sbin/rndc" cmdline_options="reload"
> 
> This is the last resource in a group after the named:lsb and an ipaddr 
> resource, so named binds to the VIP
> 
> After the reboot the re-named-reload resource is all screwed up. The start 
> seems to work, but the monitor is failing and the stop doesn't work:
> 
> Sep  5 11:14:14 dresproddns02 lrmd[82091]:   notice: operation_finished: 
> re-named-reload_stop_0:582081 [ /usr/lib/ocf/resource.d/heartbeat/anything: 
> line 60: kill: (580334) - No such process ]
> Sep  5 11:14:14 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_stop_0 (call=33446, rc=0, cib-update=11044, 
> confirmed=true) ok
> Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_start_0 (call=33450, rc=0, cib-update=11045, 
> confirmed=true) ok
> Sep  5 11:14:15 dresproddns02 lrmd[82091]:   notice: operation_finished: 
> re-named-reload_monitor_60000:582121 [ 
> /usr/lib/ocf/resource.d/heartbeat/anything: line 60: kill: (582109) - No such 
> process ]
> Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_monitor_60000 (call=33453, rc=1, cib-update=11046, 
> confirmed=false) unknown error
> 
> The ocf-tester fails on both clusters
> 
> ocf-tester -n reload -o binfile="/usr/sbin/rndc" -o cmdline_options="reload" 
> /usr/lib/ocf/resource.d/heartbeat/anything
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
> * rc=1: Monitoring an active resource should return 0
> * rc=1: Probing an active resource should return 0
> * Your agent does not support the notify action (optional)
> * Your agent does not support the demote action (optional)
> * Your agent does not support the promote action (optional)
> * Your agent does not support master/slave (optional)
> * rc=1: Monitoring an active resource should return 0
> * rc=1: Monitoring an active resource should return 0
> * Your agent does not support the reload action (optional)
> Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 4 tests
> 
> So, I guess the question is really - Why is it working at all on the cluster 
> it is working on? The rndc process doesn't hang around for more than a few 
> seconds, so the monitor should never really see it running.
> 
> I did copy over the heartbeat/anything script from the working environment to 
> the broken one, and we have the same issue.
> 
> Short of writing a resource that does a start and forces a rc=0 for 
> stop/monitor, any ideas why this is behaving the way it is?

I'm guessing there is a stale pid file around, or however the pid of binfile is 
calculated is not smart enough.


> 
> David
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to