14.01.2014, 06:25, "Andrew Beekhof" <and...@beekhof.net>: > Apart from anything else, your timeout needs to be bigger: > > Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( > commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 > from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device > 'st1' returned: -62 (Timer expired) >
Bigger than that? In :21 node2 A long time ago already booted and work (almost). #cat /var/log/cluster/mystonith.log ..... Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devdescr Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devid Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-xml Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): getconfignames Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): status Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): getconfignames Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): reset dev-cluster2-node2.unix.tensor.ru Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot ....... > On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: > >> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>> wrote: >>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>> wrote: >>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>> >>>>>>>>>>>> I'm still trying to cope with the fact that after the fence >>>>>>>>>>>> - node hangs in "pending". >>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>> In crm_mon: >>>>>>>>>> ...... >>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>> ...... >>>>>>>>>> >>>>>>>>>> The experiment was like this: >>>>>>>>>> Four nodes in cluster. >>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк >>>>>>>>>> 11). >>>>>>>>>> Thereafter, the remaining start it constantly reboot, under >>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster >>>>>>>>>> member!" ... >>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>>> behavior... >>>>>>>>>> >>>>>>>>>> Now I am in the following state: >>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>> And got the following behavior: >>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>> 3. It sends reboot victim and waits until she comes to life >>>>>>>>>> again. >>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>> Now try full update and retest. >>>>>>> That should be recent enough. Can you create a crm_report the next >>>>>>> time you reproduce? >>>>>> Of course yes. Little delay.... :) >>>>>> >>>>>> ...... >>>>>> cc1: warnings being treated as errors >>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>> upstart.c:264: error: implicit declaration of function >>>>>> ‘g_variant_lookup_value’ >>>>>> upstart.c:264: error: nested extern declaration of >>>>>> ‘g_variant_lookup_value’ >>>>>> upstart.c:264: error: assignment makes pointer from integer without a >>>>>> cast >>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>> make[1]: *** [all-recursive] Error 1 >>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>> make: *** [core] Error 1 >>>>>> >>>>>> I'm trying to solve this a problem. >>>>> Do not get solved quickly... >>>>> >>>>> >>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>> g_variant_lookup_value () Since 2.28 >>>>> >>>>> # yum list installed glib2 >>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>> This system is receiving updates from RHN Classic or Red Hat Satellite. >>>>> Loading mirror speeds from cached hostfile >>>>> Installed Packages >>>>> glib2.x86_64 >>>>> 2.26.1-3.el6 >>>>> installed >>>>> >>>>> # cat /etc/issue >>>>> CentOS release 6.5 (Final) >>>>> Kernel \r on an \m >>>> Can you try this patch? >>>> Upstart jobs wont work, but the code will compile >>>> >>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>> index 831e7cf..195c3a4 100644 >>>> --- a/lib/services/upstart.c >>>> +++ b/lib/services/upstart.c >>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>> static char * >>>> upstart_job_property(const char *obj, const gchar * iface, const char >>>> *name) >>>> { >>>> + char *output = NULL; >>>> + >>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>> + static bool err = TRUE; >>>> + >>>> + if(err) { >>>> + crm_err("This version of glib is too old to support upstart >>>> jobs"); >>>> + err = FALSE; >>>> + } >>>> +#else >>>> GError *error = NULL; >>>> GDBusProxy *proxy; >>>> GVariant *asv = NULL; >>>> GVariant *value = NULL; >>>> GVariant *_ret = NULL; >>>> - char *output = NULL; >>>> >>>> crm_info("Calling GetAll on %s", obj); >>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * >>>> iface, const char *name) >>>> >>>> g_object_unref(proxy); >>>> g_variant_unref(_ret); >>>> +#endif >>>> return output; >>>> } >>> Ok :) I patch source. >>> Type "make rc" - the same error. >> Because its not building your local changes >>> Make new copy via "fetch" - the same error. >>> It seems that if not exist >>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>> Otherwise use exist archive. >>> Cutted log ....... >>> >>> # make rc >>> make TAG=Pacemaker-1.1.11-rc3 rpm >>> make[1]: Entering directory `/root/ha/pacemaker' >>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>> \ >>> rm -f pacemaker.tar.*; >>> \ >>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>> \ >>> git commit -m "DO-NOT-PUSH" -a; >>> \ >>> git archive >>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>> git reset --mixed HEAD^; >>> \ >>> else >>> \ >>> git archive >>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | >>> gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>> fi; >>> \ >>> echo `date`: Rebuilt >>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>> \ >>> else >>> \ >>> echo `date`: Using existing tarball: >>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>> fi >>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>> ....... >>> >>> Well, "make rpm" - build rpms and I create cluster. >>> I spent the same tests and confirmed the behavior. >>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >> Thanks! > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org