Re: [Linux-HA] debugging resource configuration

Dejan Muhamedagic Tue, 02 Nov 2010 03:10:53 -0700

Hi,

On Fri, Oct 29, 2010 at 10:32:07AM -0600, Greg Woods wrote:
> On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote:
> > Just a shot in the dark here kind of ... but I know that when I had this 
> > type of problem with a stonith device it was timeout related. You could 
> > try boosting your timeouts all around, or even check what
> > 
> > # time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start
> > 
> > reports back.
> 
> [r...@vmx1 log]# time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start
> 
> real    0m0.261s
> user    0m0.188s
> sys     0m0.068s
> 
> (after which it is working fine; I can connect to the virtual service
> and get properly redirected to the real server).
> 
> I am convinced now that either there is a bug in the resource agent so
> that the monitor process just doesn't work right,  there is something
> obvious and stupid that I just don't see in my configuration that is
> wrong, or else the ldirectord script that I have (which came from the
> CentOS heartbeat-ldirectord package) is incompatible with what the
> resource agent is expecting.


If you're using resource-agents, the package should be named
ldirectord not heartbeat-ldirectord. The two packages should also
have the same release numbers, probably something like 1.0.3-x.

Thanks,

Dejan

> > If timeouts aren't it, I would start breaking out parts of the cluster 
> > config and trying it again until it works
> 
> I still haven't been able to make it work, but I have eliminated a
> number of variables. I got rid of all the IPAddr resources, order
> statements, and colocation statements. All that is there now that is
> relevant to ldirectord is:
> 
> primitive ldirectord ocf:heartbeat:ldirectord \
>         op start interval="20s" timeout="15s" \
>         op stop interval="20s" timeout="15s" \
>         op monitor interval="20s" timeout="20s" 
> 
> (I have actually tried different interval and timeout numbers but the
> result is always the same).
> 
> That's it. Then I configured the eth1:0 interface manually to correspond
> with the IP address of the virtual server configured in ldirectord.cf,
> and ran "crm resource ldirectord start". The result is the same start,
> stop, FAILED scenario repeated. The logs appear to show that it is
> running the status check every 2 seconds or so, despite my interval and
> timeout settings:
> 
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19214] Starting Linux Director
> v1.186-ha-2.1.3 as daemon
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Added virtual server:
> 128.117.64.127:25
>  [...]
> 
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Quiescent real server:
> 128.117.64.123:25 (128.117.64.127:25
> ) (Weight set to 0)
> 
> [...]
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Restored real server:
> 128.117.64.123:25 (128.117.64.127:25)
>  (Weight set to 1)
> 
> (there are similar pairs of entries for all the declared real servers)
> 
> So far so good, now comes the problem:
> 
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Invoking ldirectord
> invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] ldirectord
> for /etc/ha.d/ldirectord.cf is running with pid: 19216
> [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Exiting from ldirectord
> status
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Invoking ldirectord
> invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] ldirectord
> for /etc/ha.d/ldirectord.cf is running with pid: 19216
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Exiting from ldirectord
> status
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Invoking ldirectord
> invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] ldirectord
> for /etc/ha.d/ldirectord.cf is running with pid: 19216
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Exiting from ldirectord
> status
> [Fri Oct 29 10:11:08 2010|ldirectord.cf|19416] Invoking ldirectord
> invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf stop 
> 
> The status check should have succeeded, but the monitor process thinks
> it failed. Also as can be seen, the status check is repeated only 2
> seconds later. The corresponding log for lrmd shows:
> 
> Oct 29 10:11:05 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5526:
> start
> Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: RA output:
> (ldirectord:start:stdout) /usr/sbin/ldirector
> d /etc/ha.d/ldirectord.cf start
> Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: Managed
> ldirectord:start process 19203 exited with retur
> n code 0.
> Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5527:
> start
> Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
> operation start[5527] on ocf::ldirector
> d::ldirectord for client 4845, its parameters: CRM_meta_interval=[20000]
> CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start]
> for rsc is already running.
> Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916:
> postponing all ops on resource ldirectord by 1000 ms
> Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
> operation start[5527] on ocf::ldirectord::ldirectord for client 4845,
> its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[15000]
> crm_feature_set=[3.0.1] CRM_meta_name=[start]  for rsc is already
> running.
> Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910:
> operations on resource ldirectord already delayed
> Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed
> ldirectord:start process 19401 exited with return code 0.
> Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5528:
> stop
> Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed
> ldirectord:stop process 19406 exited with return code 0.
> Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5529:
> monitor
> Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: WARN: Managed
> ldirectord:monitor process 19418 exited with return code 7.
> Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: cancel_op: operation
> monitor[5529] on ocf::ldirectord::ldirectord for client 4845, its
> parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
> Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5530:
> stop
> Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: Managed
> ldirectord:stop process 19432 exited with return code 0.
> 
> 
> Can someone more expert than I interpret this for me? Why does it say
> that it is already running? I checked the obvious and there was
> no /var/run/ldirectord.ldirectord.cf.pid file prior to all this (there
> is now that I have started it manually).
> 
> Thanks,
> --Greg
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] debugging resource configuration

Reply via email to