Hi, Andrew and ALL! Andrew, We did not bury this topic?
16.01.2014, 12:32, "Andrey Groshev" <gre...@yandex.ru>: > 16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>: > >> On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote: >>> 15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >>>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>>> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>>>>> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> Ok, here's what happens: >>>>>>> >>>>>>> 1. node2 is lost >>>>>>> 2. fencing of node2 starts >>>>>>> 3. node2 reboots (and cluster starts) >>>>>>> 4. node2 returns to the membership >>>>>>> 5. node2 is marked as a cluster member >>>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the >>>>>>> active transition first. >>>>>>> Which is a problem since the node2 fencing operation is part of >>>>>>> that >>>>>>> 7. node2 is in a transition (pending) state until fencing passes or >>>>>>> fails >>>>>>> 8a. fencing fails: transition completes and the node joins the >>>>>>> cluster >>>>>>> >>>>>>> Thats in theory, except we automatically try again. Which isn't >>>>>>> appropriate. >>>>>>> This should be relatively easy to fix. >>>>>>> >>>>>>> 8b. fencing passes: the node is incorrectly marked as offline >>>>>>> >>>>>>> This I have no idea how to fix yet. >>>>>>> >>>>>>> On another note, it doesn't look like this agent works at all. >>>>>>> The node has been back online for a long time and the agent is >>>>>>> still timing out after 10 minutes. >>>>>>> So "Once the script makes sure that the victim will rebooted and >>>>>>> again available via ssh - it exit with 0." does not seem true. >>>>>> Damn. Looks like you're right. At some time I broke my agent and had >>>>>> not noticed it. Who will understand. >>>>> I repaired my agent - after send reboot he is wait STDIN. >>>>> Returned "normally" a behavior - hangs "pending", until manually send >>>>> reboot. :) >>>> Right. Now you're in case 8b. >>>> >>>> Can you try this patch: http://paste.fedoraproject.org/68450/38973966 >>> Killed all day experiences. >>> It turns out here that: >>> 1. Did cluster. >>> 2. On the node-2 send signal (-4) - killed corosink >>> 3. From node-1 (there DC) - stonith sent reboot >>> 4. Noda rebooted and resources start. >>> 5. Again. On the node-2 send signal (-4) - killed corosink >>> 6. Again. From node-1 (there DC) - stonith sent reboot >>> 7. Noda-2 rebooted and hangs in "pending" >>> 8. Waiting, waiting..... manually reboot. >>> 9. Noda-2 reboot and raised resources start. >>> 10. GOTO p.2 >> Logs? > > Yesterday I wrote an additional letter why not put the logs. > Read it please, it contains a few more questions. > Today again began to hang and continue along the same cycle. > Logs here http://send2me.ru/crmrep2.tar.bz2 > >>>>> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> >>>>>>> wrote: >>>>>>>> Apart from anything else, your timeout needs to be bigger: >>>>>>>> >>>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru >>>>>>>> stonith-ng: ( commands.c:1321 ) error: log_operation: Operation >>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host >>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 >>>>>>>> (Timer expired) >>>>>>>> >>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> >>>>>>>> wrote: >>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>> wrote: >>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>> wrote: >>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev >>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev >>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" >>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after >>>>>>>>>>>>>>>>>>> the fence - node hangs in "pending". >>>>>>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 >>>>>>>>>>>>>>>>> or 6 oк 11). >>>>>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, >>>>>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a >>>>>>>>>>>>>>>>> cluster member!" ... >>>>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>>>>>> Much time has passed and I can not accurately >>>>>>>>>>>>>>>>> describe the behavior... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes >>>>>>>>>>>>>>>>> to life again. >>>>>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while >>>>>>>>>>>>>>>> back >>>>>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>>>>> That should be recent enough. Can you create a crm_report >>>>>>>>>>>>>> the next time you reproduce? >>>>>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>>>>> >>>>>>>>>>>>> ...... >>>>>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer >>>>>>>>>>>>> without a cast >>>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>>>>> >>>>>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>>>>> Do not get solved quickly... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>>>>> >>>>>>>>>>>> # yum list installed glib2 >>>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>>>>>> Satellite. >>>>>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>>>>> Installed Packages >>>>>>>>>>>> glib2.x86_64 >>>>>>>>>>>> 2.26.1-3.el6 >>>>>>>>>>>> installed >>>>>>>>>>>> >>>>>>>>>>>> # cat /etc/issue >>>>>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>>>>> Kernel \r on an \m >>>>>>>>>>> Can you try this patch? >>>>>>>>>>> Upstart jobs wont work, but the code will compile >>>>>>>>>>> >>>>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>>>>> --- a/lib/services/upstart.c >>>>>>>>>>> +++ b/lib/services/upstart.c >>>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>>>>>> static char * >>>>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, >>>>>>>>>>> const char *name) >>>>>>>>>>> { >>>>>>>>>>> + char *output = NULL; >>>>>>>>>>> + >>>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>>>>> + static bool err = TRUE; >>>>>>>>>>> + >>>>>>>>>>> + if(err) { >>>>>>>>>>> + crm_err("This version of glib is too old to support >>>>>>>>>>> upstart jobs"); >>>>>>>>>>> + err = FALSE; >>>>>>>>>>> + } >>>>>>>>>>> +#else >>>>>>>>>>> GError *error = NULL; >>>>>>>>>>> GDBusProxy *proxy; >>>>>>>>>>> GVariant *asv = NULL; >>>>>>>>>>> GVariant *value = NULL; >>>>>>>>>>> GVariant *_ret = NULL; >>>>>>>>>>> - char *output = NULL; >>>>>>>>>>> >>>>>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, >>>>>>>>>>> const gchar * iface, const char *name) >>>>>>>>>>> >>>>>>>>>>> g_object_unref(proxy); >>>>>>>>>>> g_variant_unref(_ret); >>>>>>>>>>> +#endif >>>>>>>>>>> return output; >>>>>>>>>>> } >>>>>>>>>> Ok :) I patch source. >>>>>>>>>> Type "make rc" - the same error. >>>>>>>>> Because its not building your local changes >>>>>>>>>> Make new copy via "fetch" - the same error. >>>>>>>>>> It seems that if not exist >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>>>>>> Otherwise use exist archive. >>>>>>>>>> Cutted log ....... >>>>>>>>>> >>>>>>>>>> # make rc >>>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* >>>>>>>>>> pacemaker-HEAD.tar.* >>>>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; >>>>>>>>>> then \ >>>>>>>>>> rm -f pacemaker.tar.*; >>>>>>>>>> \ >>>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>>>>>> \ >>>>>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>>>>> \ >>>>>>>>>> git archive >>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>> git reset --mixed HEAD^; >>>>>>>>>> \ >>>>>>>>>> else >>>>>>>>>> \ >>>>>>>>>> git archive >>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>> fi; >>>>>>>>>> \ >>>>>>>>>> echo `date`: Rebuilt >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>> \ >>>>>>>>>> else >>>>>>>>>> \ >>>>>>>>>> echo `date`: Using existing tarball: >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>> \ >>>>>>>>>> fi >>>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>>>>> ....... >>>>>>>>>> >>>>>>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>>>>>> I spent the same tests and confirmed the behavior. >>>>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>>>>>> Thanks! >>>>>>> , >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> , >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org