10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>: > On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote: > >> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: >>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> Hi, ALL. >>>>>> >>>>>> I'm still trying to cope with the fact that after the fence - node >>>>>> hangs in "pending". >>>>> Please define "pending". Where did you see this? >>>> In crm_mon: >>>> ...... >>>> Node dev-cluster2-node2 (172793105): pending >>>> ...... >>>> >>>> The experiment was like this: >>>> Four nodes in cluster. >>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >>>> Thereafter, the remaining start it constantly reboot, under various >>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ... >>>> Then in the log fell out "Too many failures ...." >>>> All this time in the status in crm_mon is "pending". >>>> Depending on the wind direction changed to "UNCLEAN" >>>> Much time has passed and I can not accurately describe the behavior... >>>> >>>> Now I am in the following state: >>>> I tried locate the problem. Came here with this. >>>> I set big value in property stonith-timeout="600s". >>>> And got the following behavior: >>>> 1. pkill -4 corosync >>>> 2. from node with DC call my fence agent "sshbykey" >>>> 3. It sends reboot victim and waits until she comes to life again. >>> Hmmm.... what version of pacemaker? >>> This sounds like a timing issue that we fixed a while back >> Was a version 1.1.11 from December 3. >> Now try full update and retest. > > That should be recent enough. Can you create a crm_report the next time you > reproduce? >
Of course yes. Little delay.... :) ...... cc1: warnings being treated as errors upstart.c: In function ‘upstart_job_property’: upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’ upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’ upstart.c:264: error: assignment makes pointer from integer without a cast gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1 gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/ha/pacemaker/lib' make: *** [core] Error 1 I'm trying to solve this a problem. >>>> Once the script makes sure that the victim will rebooted and again >>>> available via ssh - it exit with 0. >>>> All command is logged both the victim and the killer - all right. >>>> 4. A little later, the status of the (victim) nodes in crm_mon changes >>>> to online. >>>> 5. BUT... not one resource don't start! Despite the fact that >>>> "crm_simalate -sL" shows the correct resource to start: >>>> * Start pingCheck:3 (dev-cluster2-node2) >>>> 6. In this state, we spend the next 600 seconds. >>>> After completing this timeout causes another node (not DC) decides to >>>> kill again our victim. >>>> All command again is logged both the victim and the killer - All >>>> documented :) >>>> 7. NOW all resource started in right sequence. >>>> >>>> I almost happy, but I do not like: two reboots and 10 minutes of >>>> waiting ;) >>>> And if something happens on another node, this the behavior is >>>> superimposed on old and not any resources not start until the last node >>>> will not reload twice. >>>> >>>> I tried understood this behavior. >>>> As I understand it: >>>> 1. Ultimately, in ./lib/fencing/st_client.c call >>>> internal_stonith_action_execute(). >>>> 2. It make fork and pipe from tham. >>>> 3. Async call mainloop_child_add with callback to >>>> stonith_action_async_done. >>>> 4. Add timeout g_timeout_add to TERM and KILL signals. >>>> >>>> If all right must - call stonith_action_async_done, remove timeout. >>>> For some reason this does not happen. I sit and think .... >>>>>> At this time, there are constant re-election. >>>>>> Also, I noticed the difference when you start pacemaker. >>>>>> At normal startup: >>>>>> * corosync >>>>>> * pacemakerd >>>>>> * attrd >>>>>> * pengine >>>>>> * lrmd >>>>>> * crmd >>>>>> * cib >>>>>> >>>>>> When hangs start: >>>>>> * corosync >>>>>> * pacemakerd >>>>>> * attrd >>>>>> * pengine >>>>>> * crmd >>>>>> * lrmd >>>>>> * cib. >>>>> Are you referring to the order of the daemons here? >>>>> The cib should not be at the bottom in either case. >>>>>> Who knows who runs lrmd? >>>>> Pacemakerd. >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org