Apart from anything else, your timeout needs to be bigger: Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: > > On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: > >> >> >> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>> >>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> Hi, ALL. >>>>>>>>>>> >>>>>>>>>>> I'm still trying to cope with the fact that after the fence - >>>>>>>>>>> node hangs in "pending". >>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>> In crm_mon: >>>>>>>>> ...... >>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>> ...... >>>>>>>>> >>>>>>>>> The experiment was like this: >>>>>>>>> Four nodes in cluster. >>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >>>>>>>>> Thereafter, the remaining start it constantly reboot, under >>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster >>>>>>>>> member!" ... >>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>> behavior... >>>>>>>>> >>>>>>>>> Now I am in the following state: >>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>> And got the following behavior: >>>>>>>>> 1. pkill -4 corosync >>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>> 3. It sends reboot victim and waits until she comes to life again. >>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>> Was a version 1.1.11 from December 3. >>>>>>> Now try full update and retest. >>>>>> That should be recent enough. Can you create a crm_report the next >>>>>> time you reproduce? >>>>> Of course yes. Little delay.... :) >>>>> >>>>> ...... >>>>> cc1: warnings being treated as errors >>>>> upstart.c: In function ‘upstart_job_property’: >>>>> upstart.c:264: error: implicit declaration of function >>>>> ‘g_variant_lookup_value’ >>>>> upstart.c:264: error: nested extern declaration of >>>>> ‘g_variant_lookup_value’ >>>>> upstart.c:264: error: assignment makes pointer from integer without a cast >>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>> make[1]: *** [all-recursive] Error 1 >>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>> make: *** [core] Error 1 >>>>> >>>>> I'm trying to solve this a problem. >>>> Do not get solved quickly... >>>> >>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>> g_variant_lookup_value () Since 2.28 >>>> >>>> # yum list installed glib2 >>>> Loaded plugins: fastestmirror, rhnplugin, security >>>> This system is receiving updates from RHN Classic or Red Hat Satellite. >>>> Loading mirror speeds from cached hostfile >>>> Installed Packages >>>> glib2.x86_64 >>>> 2.26.1-3.el6 >>>> installed >>>> >>>> # cat /etc/issue >>>> CentOS release 6.5 (Final) >>>> Kernel \r on an \m >>> >>> Can you try this patch? >>> Upstart jobs wont work, but the code will compile >>> >>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>> index 831e7cf..195c3a4 100644 >>> --- a/lib/services/upstart.c >>> +++ b/lib/services/upstart.c >>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>> static char * >>> upstart_job_property(const char *obj, const gchar * iface, const char *name) >>> { >>> + char *output = NULL; >>> + >>> +#if !GLIB_CHECK_VERSION(2,28,0) >>> + static bool err = TRUE; >>> + >>> + if(err) { >>> + crm_err("This version of glib is too old to support upstart jobs"); >>> + err = FALSE; >>> + } >>> +#else >>> GError *error = NULL; >>> GDBusProxy *proxy; >>> GVariant *asv = NULL; >>> GVariant *value = NULL; >>> GVariant *_ret = NULL; >>> - char *output = NULL; >>> >>> crm_info("Calling GetAll on %s", obj); >>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * >>> iface, const char *name) >>> >>> g_object_unref(proxy); >>> g_variant_unref(_ret); >>> +#endif >>> return output; >>> } >>> >> >> Ok :) I patch source. >> Type "make rc" - the same error. > > Because its not building your local changes > >> Make new copy via "fetch" - the same error. >> It seems that if not exist >> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >> Otherwise use exist archive. >> Cutted log ....... >> >> # make rc >> make TAG=Pacemaker-1.1.11-rc3 rpm >> make[1]: Entering directory `/root/ha/pacemaker' >> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >> \ >> rm -f pacemaker.tar.*; >> \ >> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >> \ >> git commit -m "DO-NOT-PUSH" -a; >> \ >> git archive >> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >> git reset --mixed HEAD^; >> \ >> else >> \ >> git archive >> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | >> gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >> fi; >> \ >> echo `date`: Rebuilt >> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >> \ >> else >> \ >> echo `date`: Using existing tarball: >> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >> fi >> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >> ....... >> >> Well, "make rpm" - build rpms and I create cluster. >> I spent the same tests and confirmed the behavior. >> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 > > Thanks!
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org