Hi, On Fri, Oct 29, 2010 at 10:32:07AM -0600, Greg Woods wrote: > On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote: > > Just a shot in the dark here kind of ... but I know that when I had this > > type of problem with a stonith device it was timeout related. You could > > try boosting your timeouts all around, or even check what > > > > # time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start > > > > reports back. > > [r...@vmx1 log]# time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start > > real 0m0.261s > user 0m0.188s > sys 0m0.068s > > (after which it is working fine; I can connect to the virtual service > and get properly redirected to the real server). > > I am convinced now that either there is a bug in the resource agent so > that the monitor process just doesn't work right, there is something > obvious and stupid that I just don't see in my configuration that is > wrong, or else the ldirectord script that I have (which came from the > CentOS heartbeat-ldirectord package) is incompatible with what the > resource agent is expecting.
If you're using resource-agents, the package should be named ldirectord not heartbeat-ldirectord. The two packages should also have the same release numbers, probably something like 1.0.3-x. Thanks, Dejan > > If timeouts aren't it, I would start breaking out parts of the cluster > > config and trying it again until it works > > I still haven't been able to make it work, but I have eliminated a > number of variables. I got rid of all the IPAddr resources, order > statements, and colocation statements. All that is there now that is > relevant to ldirectord is: > > primitive ldirectord ocf:heartbeat:ldirectord \ > op start interval="20s" timeout="15s" \ > op stop interval="20s" timeout="15s" \ > op monitor interval="20s" timeout="20s" > > (I have actually tried different interval and timeout numbers but the > result is always the same). > > That's it. Then I configured the eth1:0 interface manually to correspond > with the IP address of the virtual server configured in ldirectord.cf, > and ran "crm resource ldirectord start". The result is the same start, > stop, FAILED scenario repeated. The logs appear to show that it is > running the status check every 2 seconds or so, despite my interval and > timeout settings: > > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19214] Starting Linux Director > v1.186-ha-2.1.3 as daemon > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Added virtual server: > 128.117.64.127:25 > [...] > > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Quiescent real server: > 128.117.64.123:25 (128.117.64.127:25 > ) (Weight set to 0) > > [...] > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Restored real server: > 128.117.64.123:25 (128.117.64.127:25) > (Weight set to 1) > > (there are similar pairs of entries for all the declared real servers) > > So far so good, now comes the problem: > > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Invoking ldirectord > invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] ldirectord > for /etc/ha.d/ldirectord.cf is running with pid: 19216 > [Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Exiting from ldirectord > status > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Invoking ldirectord > invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] ldirectord > for /etc/ha.d/ldirectord.cf is running with pid: 19216 > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Exiting from ldirectord > status > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Invoking ldirectord > invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] ldirectord > for /etc/ha.d/ldirectord.cf is running with pid: 19216 > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Exiting from ldirectord > status > [Fri Oct 29 10:11:08 2010|ldirectord.cf|19416] Invoking ldirectord > invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf stop > > The status check should have succeeded, but the monitor process thinks > it failed. Also as can be seen, the status check is repeated only 2 > seconds later. The corresponding log for lrmd shows: > > Oct 29 10:11:05 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5526: > start > Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: RA output: > (ldirectord:start:stdout) /usr/sbin/ldirector > d /etc/ha.d/ldirectord.cf start > Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: Managed > ldirectord:start process 19203 exited with retur > n code 0. > Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5527: > start > Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: > operation start[5527] on ocf::ldirector > d::ldirectord for client 4845, its parameters: CRM_meta_interval=[20000] > CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start] > for rsc is already running. > Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916: > postponing all ops on resource ldirectord by 1000 ms > Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906: > operation start[5527] on ocf::ldirectord::ldirectord for client 4845, > its parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[15000] > crm_feature_set=[3.0.1] CRM_meta_name=[start] for rsc is already > running. > Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910: > operations on resource ldirectord already delayed > Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed > ldirectord:start process 19401 exited with return code 0. > Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5528: > stop > Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: Managed > ldirectord:stop process 19406 exited with return code 0. > Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5529: > monitor > Oct 29 10:11:08 vmx1.ucar.edu lrmd: [4842]: WARN: Managed > ldirectord:monitor process 19418 exited with return code 7. > Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: cancel_op: operation > monitor[5529] on ocf::ldirectord::ldirectord for client 4845, its > parameters: CRM_meta_interval=[20000] CRM_meta_timeout=[20000] > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] cancelled > Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5530: > stop > Oct 29 10:11:10 vmx1.ucar.edu lrmd: [4842]: info: Managed > ldirectord:stop process 19432 exited with return code 0. > > > Can someone more expert than I interpret this for me? Why does it say > that it is already running? I checked the obvious and there was > no /var/run/ldirectord.ldirectord.cf.pid file prior to all this (there > is now that I have started it manually). > > Thanks, > --Greg > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
