On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote:
> > > 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >> >>> Ok, here's what happens: >>> >>> 1. node2 is lost >>> 2. fencing of node2 starts >>> 3. node2 reboots (and cluster starts) >>> 4. node2 returns to the membership >>> 5. node2 is marked as a cluster member >>> 6. DC tries to bring it into the cluster, but needs to cancel the active >>> transition first. >>> Which is a problem since the node2 fencing operation is part of that >>> 7. node2 is in a transition (pending) state until fencing passes or fails >>> 8a. fencing fails: transition completes and the node joins the cluster >>> >>> Thats in theory, except we automatically try again. Which isn't >>> appropriate. >>> This should be relatively easy to fix. >>> >>> 8b. fencing passes: the node is incorrectly marked as offline >>> >>> This I have no idea how to fix yet. >>> >>> On another note, it doesn't look like this agent works at all. >>> The node has been back online for a long time and the agent is still >>> timing out after 10 minutes. >>> So "Once the script makes sure that the victim will rebooted and again >>> available via ssh - it exit with 0." does not seem true. >> >> Damn. Looks like you're right. At some time I broke my agent and had not >> noticed it. Who will understand. > > I repaired my agent - after send reboot he is wait STDIN. > Returned "normally" a behavior - hangs "pending", until manually send reboot. > :) Right. Now you're in case 8b. Can you try this patch: http://paste.fedoraproject.org/68450/38973966 > New logs: http://send2me.ru/crmrep1.tar.bz2 > >> >>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>>> Apart from anything else, your timeout needs to be bigger: >>>> >>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( >>>> commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] >>>> (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with >>>> device 'st1' returned: -62 (Timer expired) >>>> >>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: >>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm still trying to cope with the fact that after the >>>>>>>>>>>>>>> fence - node hangs in "pending". >>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>> ...... >>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>> ...... >>>>>>>>>>>>> >>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк >>>>>>>>>>>>> 11). >>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under >>>>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster >>>>>>>>>>>>> member!" ... >>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>>>>>> behavior... >>>>>>>>>>>>> >>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life >>>>>>>>>>>>> again. >>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>> Now try full update and retest. >>>>>>>>>> That should be recent enough. Can you create a crm_report the >>>>>>>>>> next time you reproduce? >>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>> >>>>>>>>> ...... >>>>>>>>> cc1: warnings being treated as errors >>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>> upstart.c:264: error: assignment makes pointer from integer without >>>>>>>>> a cast >>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>> make: *** [core] Error 1 >>>>>>>>> >>>>>>>>> I'm trying to solve this a problem. >>>>>>>> Do not get solved quickly... >>>>>>>> >>>>>>>> >>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>> >>>>>>>> # yum list installed glib2 >>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>> Satellite. >>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>> Installed Packages >>>>>>>> glib2.x86_64 >>>>>>>> 2.26.1-3.el6 >>>>>>>> installed >>>>>>>> >>>>>>>> # cat /etc/issue >>>>>>>> CentOS release 6.5 (Final) >>>>>>>> Kernel \r on an \m >>>>>>> Can you try this patch? >>>>>>> Upstart jobs wont work, but the code will compile >>>>>>> >>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>> index 831e7cf..195c3a4 100644 >>>>>>> --- a/lib/services/upstart.c >>>>>>> +++ b/lib/services/upstart.c >>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>> static char * >>>>>>> upstart_job_property(const char *obj, const gchar * iface, const char >>>>>>> *name) >>>>>>> { >>>>>>> + char *output = NULL; >>>>>>> + >>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>> + static bool err = TRUE; >>>>>>> + >>>>>>> + if(err) { >>>>>>> + crm_err("This version of glib is too old to support upstart >>>>>>> jobs"); >>>>>>> + err = FALSE; >>>>>>> + } >>>>>>> +#else >>>>>>> GError *error = NULL; >>>>>>> GDBusProxy *proxy; >>>>>>> GVariant *asv = NULL; >>>>>>> GVariant *value = NULL; >>>>>>> GVariant *_ret = NULL; >>>>>>> - char *output = NULL; >>>>>>> >>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar >>>>>>> * iface, const char *name) >>>>>>> >>>>>>> g_object_unref(proxy); >>>>>>> g_variant_unref(_ret); >>>>>>> +#endif >>>>>>> return output; >>>>>>> } >>>>>> Ok :) I patch source. >>>>>> Type "make rc" - the same error. >>>>> Because its not building your local changes >>>>>> Make new copy via "fetch" - the same error. >>>>>> It seems that if not exist >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>> Otherwise use exist archive. >>>>>> Cutted log ....... >>>>>> >>>>>> # make rc >>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>>>> \ >>>>>> rm -f pacemaker.tar.*; >>>>>> \ >>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>> \ >>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>> \ >>>>>> git archive >>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>> git reset --mixed HEAD^; >>>>>> \ >>>>>> else >>>>>> \ >>>>>> git archive >>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>> fi; >>>>>> \ >>>>>> echo `date`: Rebuilt >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>> \ >>>>>> else >>>>>> \ >>>>>> echo `date`: Using existing tarball: >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>> fi >>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>> ....... >>>>>> >>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>> I spent the same tests and confirmed the behavior. >>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>> Thanks! >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org