14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: > Ok, here's what happens: > > 1. node2 is lost > 2. fencing of node2 starts > 3. node2 reboots (and cluster starts) > 4. node2 returns to the membership > 5. node2 is marked as a cluster member > 6. DC tries to bring it into the cluster, but needs to cancel the active > transition first. > Which is a problem since the node2 fencing operation is part of that > 7. node2 is in a transition (pending) state until fencing passes or fails > 8a. fencing fails: transition completes and the node joins the cluster > > Thats in theory, except we automatically try again. Which isn't appropriate. > This should be relatively easy to fix. > > 8b. fencing passes: the node is incorrectly marked as offline > > This I have no idea how to fix yet. > > On another note, it doesn't look like this agent works at all. > The node has been back online for a long time and the agent is still timing > out after 10 minutes. > So "Once the script makes sure that the victim will rebooted and again > available via ssh - it exit with 0." does not seem true.
Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand. > On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: > >> Apart from anything else, your timeout needs to be bigger: >> >> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( >> commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call >> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device >> 'st1' returned: -62 (Timer expired) >> >> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: >>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm still trying to cope with the fact that after the fence >>>>>>>>>>>>> - node hangs in "pending". >>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>> In crm_mon: >>>>>>>>>>> ...... >>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>> ...... >>>>>>>>>>> >>>>>>>>>>> The experiment was like this: >>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк >>>>>>>>>>> 11). >>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under >>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster >>>>>>>>>>> member!" ... >>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>>>> behavior... >>>>>>>>>>> >>>>>>>>>>> Now I am in the following state: >>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>> And got the following behavior: >>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life >>>>>>>>>>> again. >>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>> Now try full update and retest. >>>>>>>> That should be recent enough. Can you create a crm_report the next >>>>>>>> time you reproduce? >>>>>>> Of course yes. Little delay.... :) >>>>>>> >>>>>>> ...... >>>>>>> cc1: warnings being treated as errors >>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>> ‘g_variant_lookup_value’ >>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>> ‘g_variant_lookup_value’ >>>>>>> upstart.c:264: error: assignment makes pointer from integer without a >>>>>>> cast >>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>> make: *** [core] Error 1 >>>>>>> >>>>>>> I'm trying to solve this a problem. >>>>>> Do not get solved quickly... >>>>>> >>>>>> >>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>> g_variant_lookup_value () Since 2.28 >>>>>> >>>>>> # yum list installed glib2 >>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>> This system is receiving updates from RHN Classic or Red Hat Satellite. >>>>>> Loading mirror speeds from cached hostfile >>>>>> Installed Packages >>>>>> glib2.x86_64 >>>>>> 2.26.1-3.el6 >>>>>> installed >>>>>> >>>>>> # cat /etc/issue >>>>>> CentOS release 6.5 (Final) >>>>>> Kernel \r on an \m >>>>> Can you try this patch? >>>>> Upstart jobs wont work, but the code will compile >>>>> >>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>> index 831e7cf..195c3a4 100644 >>>>> --- a/lib/services/upstart.c >>>>> +++ b/lib/services/upstart.c >>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>> static char * >>>>> upstart_job_property(const char *obj, const gchar * iface, const char >>>>> *name) >>>>> { >>>>> + char *output = NULL; >>>>> + >>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>> + static bool err = TRUE; >>>>> + >>>>> + if(err) { >>>>> + crm_err("This version of glib is too old to support upstart >>>>> jobs"); >>>>> + err = FALSE; >>>>> + } >>>>> +#else >>>>> GError *error = NULL; >>>>> GDBusProxy *proxy; >>>>> GVariant *asv = NULL; >>>>> GVariant *value = NULL; >>>>> GVariant *_ret = NULL; >>>>> - char *output = NULL; >>>>> >>>>> crm_info("Calling GetAll on %s", obj); >>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * >>>>> iface, const char *name) >>>>> >>>>> g_object_unref(proxy); >>>>> g_variant_unref(_ret); >>>>> +#endif >>>>> return output; >>>>> } >>>> Ok :) I patch source. >>>> Type "make rc" - the same error. >>> Because its not building your local changes >>>> Make new copy via "fetch" - the same error. >>>> It seems that if not exist >>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>> Otherwise use exist archive. >>>> Cutted log ....... >>>> >>>> # make rc >>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>> make[1]: Entering directory `/root/ha/pacemaker' >>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>> \ >>>> rm -f pacemaker.tar.*; >>>> \ >>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>> \ >>>> git commit -m "DO-NOT-PUSH" -a; >>>> \ >>>> git archive >>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>> git reset --mixed HEAD^; >>>> \ >>>> else >>>> \ >>>> git archive >>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 >>>> | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>> fi; >>>> \ >>>> echo `date`: Rebuilt >>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>> \ >>>> else >>>> \ >>>> echo `date`: Using existing tarball: >>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>> fi >>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>> ....... >>>> >>>> Well, "make rpm" - build rpms and I create cluster. >>>> I spent the same tests and confirmed the behavior. >>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>> Thanks! > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org