10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>: > 10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: > >> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>> Hi, ALL. >>>>>>> >>>>>>> I'm still trying to cope with the fact that after the fence - node >>>>>>> hangs in "pending". >>>>>> Please define "pending". Where did you see this? >>>>> In crm_mon: >>>>> ...... >>>>> Node dev-cluster2-node2 (172793105): pending >>>>> ...... >>>>> >>>>> The experiment was like this: >>>>> Four nodes in cluster. >>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >>>>> Thereafter, the remaining start it constantly reboot, under various >>>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ... >>>>> Then in the log fell out "Too many failures ...." >>>>> All this time in the status in crm_mon is "pending". >>>>> Depending on the wind direction changed to "UNCLEAN" >>>>> Much time has passed and I can not accurately describe the behavior... >>>>> >>>>> Now I am in the following state: >>>>> I tried locate the problem. Came here with this. >>>>> I set big value in property stonith-timeout="600s". >>>>> And got the following behavior: >>>>> 1. pkill -4 corosync >>>>> 2. from node with DC call my fence agent "sshbykey" >>>>> 3. It sends reboot victim and waits until she comes to life again. >>>> Hmmm.... what version of pacemaker? >>>> This sounds like a timing issue that we fixed a while back >>> Was a version 1.1.11 from December 3. >>> Now try full update and retest. >> That should be recent enough. Can you create a crm_report the next time >> you reproduce? > > Of course yes. Little delay.... :) > > ...... > cc1: warnings being treated as errors > upstart.c: In function ‘upstart_job_property’: > upstart.c:264: error: implicit declaration of function > ‘g_variant_lookup_value’ > upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ > upstart.c:264: error: assignment makes pointer from integer without a cast > gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 > gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/root/ha/pacemaker/lib' > make: *** [core] Error 1 > > I'm trying to solve this a problem.
Do not get solved quickly... https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value g_variant_lookup_value () Since 2.28 # yum list installed glib2 Loaded plugins: fastestmirror, rhnplugin, security This system is receiving updates from RHN Classic or Red Hat Satellite. Loading mirror speeds from cached hostfile Installed Packages glib2.x86_64 2.26.1-3.el6 installed # cat /etc/issue CentOS release 6.5 (Final) Kernel \r on an \m >>>>> Once the script makes sure that the victim will rebooted and again >>>>> available via ssh - it exit with 0. >>>>> All command is logged both the victim and the killer - all right. >>>>> 4. A little later, the status of the (victim) nodes in crm_mon >>>>> changes to online. >>>>> 5. BUT... not one resource don't start! Despite the fact that >>>>> "crm_simalate -sL" shows the correct resource to start: >>>>> * Start pingCheck:3 (dev-cluster2-node2) >>>>> 6. In this state, we spend the next 600 seconds. >>>>> After completing this timeout causes another node (not DC) decides >>>>> to kill again our victim. >>>>> All command again is logged both the victim and the killer - All >>>>> documented :) >>>>> 7. NOW all resource started in right sequence. >>>>> >>>>> I almost happy, but I do not like: two reboots and 10 minutes of >>>>> waiting ;) >>>>> And if something happens on another node, this the behavior is >>>>> superimposed on old and not any resources not start until the last node >>>>> will not reload twice. >>>>> >>>>> I tried understood this behavior. >>>>> As I understand it: >>>>> 1. Ultimately, in ./lib/fencing/st_client.c call >>>>> internal_stonith_action_execute(). >>>>> 2. It make fork and pipe from tham. >>>>> 3. Async call mainloop_child_add with callback to >>>>> stonith_action_async_done. >>>>> 4. Add timeout g_timeout_add to TERM and KILL signals. >>>>> >>>>> If all right must - call stonith_action_async_done, remove timeout. >>>>> For some reason this does not happen. I sit and think .... >>>>>>> At this time, there are constant re-election. >>>>>>> Also, I noticed the difference when you start pacemaker. >>>>>>> At normal startup: >>>>>>> * corosync >>>>>>> * pacemakerd >>>>>>> * attrd >>>>>>> * pengine >>>>>>> * lrmd >>>>>>> * crmd >>>>>>> * cib >>>>>>> >>>>>>> When hangs start: >>>>>>> * corosync >>>>>>> * pacemakerd >>>>>>> * attrd >>>>>>> * pengine >>>>>>> * crmd >>>>>>> * lrmd >>>>>>> * cib. >>>>>> Are you referring to the order of the daemons here? >>>>>> The cib should not be at the bottom in either case. >>>>>>> Who knows who runs lrmd? >>>>>> Pacemakerd. >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> , >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> , >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org