15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: > On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote: > >> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>> 14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>: >>>> Ok, here's what happens: >>>> >>>> 1. node2 is lost >>>> 2. fencing of node2 starts >>>> 3. node2 reboots (and cluster starts) >>>> 4. node2 returns to the membership >>>> 5. node2 is marked as a cluster member >>>> 6. DC tries to bring it into the cluster, but needs to cancel the active >>>> transition first. >>>> Which is a problem since the node2 fencing operation is part of that >>>> 7. node2 is in a transition (pending) state until fencing passes or fails >>>> 8a. fencing fails: transition completes and the node joins the cluster >>>> >>>> Thats in theory, except we automatically try again. Which isn't >>>> appropriate. >>>> This should be relatively easy to fix. >>>> >>>> 8b. fencing passes: the node is incorrectly marked as offline >>>> >>>> This I have no idea how to fix yet. >>>> >>>> On another note, it doesn't look like this agent works at all. >>>> The node has been back online for a long time and the agent is still >>>> timing out after 10 minutes. >>>> So "Once the script makes sure that the victim will rebooted and again >>>> available via ssh - it exit with 0." does not seem true. >>> Damn. Looks like you're right. At some time I broke my agent and had not >>> noticed it. Who will understand. >> I repaired my agent - after send reboot he is wait STDIN. >> Returned "normally" a behavior - hangs "pending", until manually send >> reboot. :) > > Right. Now you're in case 8b. > > Can you try this patch: http://paste.fedoraproject.org/68450/38973966
Addition to the previous letter (I had to get away from work ). I would add is this: 1. You "te_utils.c" bigger about 20 strings. 2. crm_mon -Anfc - during the election / re-election of strange behavior . It can display the names of nodes and their statuses, but one moment not display a list of resources and their statuses. And next tick show all as normal. Something like this: .... node1 - online pgsql: master node2 - pending pgsql: started node3 - online node4 - online pgsql: started. ..... Ie looks like as on node3 restarted a resource pgsql. Actually nothing will happen. 3 . More about output crm_mon and statuses. Showed nodes as not clean, but resources on this node have status - "started". This is misleading. 4 . crm_report... I not attach because before noon there was a lot of unnecessary , not systematic tests. And after dinner crm_report eaten all memory , pacemaker stopped reply and node was killed via stonith. Tomorrow ( already Today) I will do a series of only the necessary tests. > >> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> wrote: >>>>> Apart from anything else, your timeout needs to be bigger: >>>>> >>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: >>>>> ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] >>>>> (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' >>>>> with device 'st1' returned: -62 (Timer expired) >>>>> >>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> wrote: >>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>> wrote: >>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev >>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev >>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after the >>>>>>>>>>>>>>>> fence - node hangs in "pending". >>>>>>>>>>>>>>> Please define "pending". Where did you see this? >>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>> ...... >>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>>>>>>>>>> ...... >>>>>>>>>>>>>> >>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 >>>>>>>>>>>>>> oк 11). >>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, >>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a >>>>>>>>>>>>>> cluster member!" ... >>>>>>>>>>>>>> Then in the log fell out "Too many failures ...." >>>>>>>>>>>>>> All this time in the status in crm_mon is "pending". >>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>>>>>>>>>> Much time has passed and I can not accurately describe the >>>>>>>>>>>>>> behavior... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>> I tried locate the problem. Came here with this. >>>>>>>>>>>>>> I set big value in property stonith-timeout="600s". >>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to >>>>>>>>>>>>>> life again. >>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>> This sounds like a timing issue that we fixed a while back >>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>> That should be recent enough. Can you create a crm_report the >>>>>>>>>>> next time you reproduce? >>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>> >>>>>>>>>> ...... >>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>> upstart.c:264: error: implicit declaration of function >>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>> upstart.c:264: error: nested extern declaration of >>>>>>>>>> ‘g_variant_lookup_value’ >>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer >>>>>>>>>> without a cast >>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib' >>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>> >>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>> Do not get solved quickly... >>>>>>>>> >>>>>>>>> >>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>> >>>>>>>>> # yum list installed glib2 >>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security >>>>>>>>> This system is receiving updates from RHN Classic or Red Hat >>>>>>>>> Satellite. >>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>> Installed Packages >>>>>>>>> glib2.x86_64 >>>>>>>>> 2.26.1-3.el6 >>>>>>>>> installed >>>>>>>>> >>>>>>>>> # cat /etc/issue >>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>> Kernel \r on an \m >>>>>>>> Can you try this patch? >>>>>>>> Upstart jobs wont work, but the code will compile >>>>>>>> >>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c >>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>> --- a/lib/services/upstart.c >>>>>>>> +++ b/lib/services/upstart.c >>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) >>>>>>>> static char * >>>>>>>> upstart_job_property(const char *obj, const gchar * iface, const >>>>>>>> char *name) >>>>>>>> { >>>>>>>> + char *output = NULL; >>>>>>>> + >>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>> + static bool err = TRUE; >>>>>>>> + >>>>>>>> + if(err) { >>>>>>>> + crm_err("This version of glib is too old to support >>>>>>>> upstart jobs"); >>>>>>>> + err = FALSE; >>>>>>>> + } >>>>>>>> +#else >>>>>>>> GError *error = NULL; >>>>>>>> GDBusProxy *proxy; >>>>>>>> GVariant *asv = NULL; >>>>>>>> GVariant *value = NULL; >>>>>>>> GVariant *_ret = NULL; >>>>>>>> - char *output = NULL; >>>>>>>> >>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const >>>>>>>> gchar * iface, const char *name) >>>>>>>> >>>>>>>> g_object_unref(proxy); >>>>>>>> g_variant_unref(_ret); >>>>>>>> +#endif >>>>>>>> return output; >>>>>>>> } >>>>>>> Ok :) I patch source. >>>>>>> Type "make rc" - the same error. >>>>>> Because its not building your local changes >>>>>>> Make new copy via "fetch" - the same error. >>>>>>> It seems that if not exist >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. >>>>>>> Otherwise use exist archive. >>>>>>> Cutted log ....... >>>>>>> >>>>>>> # make rc >>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.* >>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>>>>> \ >>>>>>> rm -f pacemaker.tar.*; >>>>>>> \ >>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then >>>>>>> \ >>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>> \ >>>>>>> git archive >>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>> git reset --mixed HEAD^; >>>>>>> \ >>>>>>> else >>>>>>> \ >>>>>>> git archive >>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>> fi; >>>>>>> \ >>>>>>> echo `date`: Rebuilt >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>> \ >>>>>>> else >>>>>>> \ >>>>>>> echo `date`: Using existing tarball: >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>> fi >>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: >>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>> ....... >>>>>>> >>>>>>> Well, "make rpm" - build rpms and I create cluster. >>>>>>> I spent the same tests and confirmed the behavior. >>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2 >>>>>> Thanks! >>>> , >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org