14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: > 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: > >> Ok, here's what happens: >> >> 1. node2 is lost >> 2. fencing of node2 starts >> 3. node2 reboots (and cluster starts) >> 4. node2 returns to the membership >> 5. node2 is marked as a cluster member >> 6. DC tries to bring it into the cluster, but needs to cancel the active >> transition first. >> Which is a problem since the node2 fencing operation is part of that >> 7. node2 is in a transition (pending) state until fencing passes or fails >> 8a. fencing fails: transition completes and the node joins the cluster >> >> Thats in theory, except we automatically try again. Which isn't appropriate. >> This should be relatively easy to fix. >> >> 8b. fencing passes: the node is incorrectly marked as offline >> >> This I have no idea how to fix yet. >> >> On another note, it doesn't look like this agent works at all. >> The node has been back online for a long time and the agent is still timing >> out after 10 minutes. >> So "Once the script makes sure that the victim will rebooted and again >> available via ssh - it exit with 0." does not seem true. > > Damn. Looks like you're right. At some time I broke my agent and had not > noticed it. Who will understand.
I repaired my agent - after send reboot he is wait STDIN. Returned "normally" a behavior - hangs "pending", until manually send reboot. :) New logs: http://send2me.ru/crmrep1.tar.bz2 > >> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>> Apart from anything else, your timeout needs to be bigger: >>> >>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( >>> commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call >>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device >>> 'st1' returned: -62 (Timer expired) >>> >>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: >>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>> wrote: >>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm still trying to cope with the fact that after the >>>>>>>>>>>>>> fence - node hangs in "pending". >>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>> In crm_mon: >>>>>>>>>>>> ...... >>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>> ...... >>>>>>>>>>>> >>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк >>>>>>>>>>>> 11). >>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under >>>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster >>>>>>>>>>>> member!" ... >>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>>>>> behavior... >>>>>>>>>>>> >>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life >>>>>>>>>>>> again. >>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>> Now try full update and retest. >>>>>>>>> That should be recent enough. Can you create a crm_report the next >>>>>>>>> time you reproduce? >>>>>>>> Of course yes. Little delay.... :) >>>>>>>> >>>>>>>> ...... >>>>>>>> cc1: warnings being treated as errors >>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>> ‘g_variant_lookup_value’ >>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>> ‘g_variant_lookup_value’ >>>>>>>> upstart.c:264: error: assignment makes pointer from integer without >>>>>>>> a cast >>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>> make: *** [core] Error 1 >>>>>>>> >>>>>>>> I'm trying to solve this a problem. >>>>>>> Do not get solved quickly... >>>>>>> >>>>>>> >>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>> >>>>>>> # yum list installed glib2 >>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>> Satellite. >>>>>>> Loading mirror speeds from cached hostfile >>>>>>> Installed Packages >>>>>>> glib2.x86_64 >>>>>>> 2.26.1-3.el6 >>>>>>> installed >>>>>>> >>>>>>> # cat /etc/issue >>>>>>> CentOS release 6.5 (Final) >>>>>>> Kernel \r on an \m >>>>>> Can you try this patch? >>>>>> Upstart jobs wont work, but the code will compile >>>>>> >>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>> index 831e7cf..195c3a4 100644 >>>>>> --- a/lib/services/upstart.c >>>>>> +++ b/lib/services/upstart.c >>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>> static char * >>>>>> upstart_job_property(const char *obj, const gchar * iface, const char >>>>>> *name) >>>>>> { >>>>>> + char *output = NULL; >>>>>> + >>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>> + static bool err = TRUE; >>>>>> + >>>>>> + if(err) { >>>>>> + crm_err("This version of glib is too old to support upstart >>>>>> jobs"); >>>>>> + err = FALSE; >>>>>> + } >>>>>> +#else >>>>>> GError *error = NULL; >>>>>> GDBusProxy *proxy; >>>>>> GVariant *asv = NULL; >>>>>> GVariant *value = NULL; >>>>>> GVariant *_ret = NULL; >>>>>> - char *output = NULL; >>>>>> >>>>>> crm_info("Calling GetAll on %s", obj); >>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar >>>>>> * iface, const char *name) >>>>>> >>>>>> g_object_unref(proxy); >>>>>> g_variant_unref(_ret); >>>>>> +#endif >>>>>> return output; >>>>>> } >>>>> Ok :) I patch source. >>>>> Type "make rc" - the same error. >>>> Because its not building your local changes >>>>> Make new copy via "fetch" - the same error. >>>>> It seems that if not exist >>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>> Otherwise use exist archive. >>>>> Cutted log ....... >>>>> >>>>> # make rc >>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>>> \ >>>>> rm -f pacemaker.tar.*; >>>>> \ >>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>> \ >>>>> git commit -m "DO-NOT-PUSH" -a; >>>>> \ >>>>> git archive >>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>> git reset --mixed HEAD^; >>>>> \ >>>>> else >>>>> \ >>>>> git archive >>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 >>>>> | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>> fi; >>>>> \ >>>>> echo `date`: Rebuilt >>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>> \ >>>>> else >>>>> \ >>>>> echo `date`: Using existing tarball: >>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>> fi >>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>> ....... >>>>> >>>>> Well, "make rpm" - build rpms and I create cluster. >>>>> I spent the same tests and confirmed the behavior. >>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>> Thanks! >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org