On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> wrote:
> > > 10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: >> 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: >> >>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> Hi, ALL. >>>>>>>> >>>>>>>> I'm still trying to cope with the fact that after the fence - >>>>>>>> node hangs in "pending". >>>>>>> Please define "pending". Where did you see this? >>>>>> In crm_mon: >>>>>> ...... >>>>>> Node dev-cluster2-node2 (172793105): pending >>>>>> ...... >>>>>> >>>>>> The experiment was like this: >>>>>> Four nodes in cluster. >>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >>>>>> Thereafter, the remaining start it constantly reboot, under various >>>>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ... >>>>>> Then in the log fell out "Too many failures ...." >>>>>> All this time in the status in crm_mon is "pending". >>>>>> Depending on the wind direction changed to "UNCLEAN" >>>>>> Much time has passed and I can not accurately describe the >>>>>> behavior... >>>>>> >>>>>> Now I am in the following state: >>>>>> I tried locate the problem. Came here with this. >>>>>> I set big value in property stonith-timeout="600s". >>>>>> And got the following behavior: >>>>>> 1. pkill -4 corosync >>>>>> 2. from node with DC call my fence agent "sshbykey" >>>>>> 3. It sends reboot victim and waits until she comes to life again. >>>>> Hmmm.... what version of pacemaker? >>>>> This sounds like a timing issue that we fixed a while back >>>> Was a version 1.1.11 from December 3. >>>> Now try full update and retest. >>> That should be recent enough. Can you create a crm_report the next time >>> you reproduce? >> >> Of course yes. Little delay.... :) >> >> ...... >> cc1: warnings being treated as errors >> upstart.c: In function ‘upstart_job_property’: >> upstart.c:264: error: implicit declaration of function >> ‘g_variant_lookup_value’ >> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ >> upstart.c:264: error: assignment makes pointer from integer without a cast >> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 >> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/root/ha/pacemaker/lib' >> make: *** [core] Error 1 >> >> I'm trying to solve this a problem. > > > Do not get solved quickly... > > https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value > g_variant_lookup_value () Since 2.28 > > # yum list installed glib2 > Loaded plugins: fastestmirror, rhnplugin, security > This system is receiving updates from RHN Classic or Red Hat Satellite. > Loading mirror speeds from cached hostfile > Installed Packages > glib2.x86_64 > 2.26.1-3.el6 > installed > > # cat /etc/issue > CentOS release 6.5 (Final) > Kernel \r on an \m Can you try this patch? Upstart jobs wont work, but the code will compile diff --git a/lib/services/upstart.c b/lib/services/upstart.c index 831e7cf..195c3a4 100644 --- a/lib/services/upstart.c +++ b/lib/services/upstart.c @@ -231,12 +231,21 @@ upstart_job_exists(const char *name) static char * upstart_job_property(const char *obj, const gchar * iface, const char *name) { + char *output = NULL; + +#if !GLIB_CHECK_VERSION(2,28,0) + static bool err = TRUE; + + if(err) { + crm_err("This version of glib is too old to support upstart jobs"); + err = FALSE; + } +#else GError *error = NULL; GDBusProxy *proxy; GVariant *asv = NULL; GVariant *value = NULL; GVariant *_ret = NULL; - char *output = NULL; crm_info("Calling GetAll on %s", obj); proxy = get_proxy(obj, BUS_PROPERTY_IFACE); @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name) g_object_unref(proxy); g_variant_unref(_ret); +#endif return output; } > > >>>>>> Once the script makes sure that the victim will rebooted and again >>>>>> available via ssh - it exit with 0. >>>>>> All command is logged both the victim and the killer - all right. >>>>>> 4. A little later, the status of the (victim) nodes in crm_mon >>>>>> changes to online. >>>>>> 5. BUT... not one resource don't start! Despite the fact that >>>>>> "crm_simalate -sL" shows the correct resource to start: >>>>>> * Start pingCheck:3 (dev-cluster2-node2) >>>>>> 6. In this state, we spend the next 600 seconds. >>>>>> After completing this timeout causes another node (not DC) decides >>>>>> to kill again our victim. >>>>>> All command again is logged both the victim and the killer - All >>>>>> documented :) >>>>>> 7. NOW all resource started in right sequence. >>>>>> >>>>>> I almost happy, but I do not like: two reboots and 10 minutes of >>>>>> waiting ;) >>>>>> And if something happens on another node, this the behavior is >>>>>> superimposed on old and not any resources not start until the last node >>>>>> will not reload twice. >>>>>> >>>>>> I tried understood this behavior. >>>>>> As I understand it: >>>>>> 1. Ultimately, in ./lib/fencing/st_client.c call >>>>>> internal_stonith_action_execute(). >>>>>> 2. It make fork and pipe from tham. >>>>>> 3. Async call mainloop_child_add with callback to >>>>>> stonith_action_async_done. >>>>>> 4. Add timeout g_timeout_add to TERM and KILL signals. >>>>>> >>>>>> If all right must - call stonith_action_async_done, remove timeout. >>>>>> For some reason this does not happen. I sit and think .... >>>>>>>> At this time, there are constant re-election. >>>>>>>> Also, I noticed the difference when you start pacemaker. >>>>>>>> At normal startup: >>>>>>>> * corosync >>>>>>>> * pacemakerd >>>>>>>> * attrd >>>>>>>> * pengine >>>>>>>> * lrmd >>>>>>>> * crmd >>>>>>>> * cib >>>>>>>> >>>>>>>> When hangs start: >>>>>>>> * corosync >>>>>>>> * pacemakerd >>>>>>>> * attrd >>>>>>>> * pengine >>>>>>>> * crmd >>>>>>>> * lrmd >>>>>>>> * cib. >>>>>>> Are you referring to the order of the daemons here? >>>>>>> The cib should not be at the bottom in either case. >>>>>>>> Who knows who runs lrmd? >>>>>>> Pacemakerd. >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> , >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org