On 10 Jan 2014, at 5:03 pm, Andrey Groshev <gre...@yandex.ru> wrote:
> 10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>: > >> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>> 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>> Hi, ALL. >>>>> >>>>> I'm still trying to cope with the fact that after the fence - node >>>>> hangs in "pending". >>>> Please define "pending". Where did you see this? >>> In crm_mon: >>> ...... >>> Node dev-cluster2-node2 (172793105): pending >>> ...... >>> >>> The experiment was like this: >>> Four nodes in cluster. >>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). >>> Thereafter, the remaining start it constantly reboot, under various >>> pretexts, "softly whistling", "fly low", "not a cluster member!" ... >>> Then in the log fell out "Too many failures ...." >>> All this time in the status in crm_mon is "pending". >>> Depending on the wind direction changed to "UNCLEAN" >>> Much time has passed and I can not accurately describe the behavior... >>> >>> Now I am in the following state: >>> I tried locate the problem. Came here with this. >>> I set big value in property stonith-timeout="600s". >>> And got the following behavior: >>> 1. pkill -4 corosync >>> 2. from node with DC call my fence agent "sshbykey" >>> 3. It sends reboot victim and waits until she comes to life again. >> Hmmm.... what version of pacemaker? >> This sounds like a timing issue that we fixed a while back > > Was a version 1.1.11 from December 3. > Now try full update and retest. That should be recent enough. Can you create a crm_report the next time you reproduce? > >>> Once the script makes sure that the victim will rebooted and again >>> available via ssh - it exit with 0. >>> All command is logged both the victim and the killer - all right. >>> 4. A little later, the status of the (victim) nodes in crm_mon changes to >>> online. >>> 5. BUT... not one resource don't start! Despite the fact that >>> "crm_simalate -sL" shows the correct resource to start: >>> * Start pingCheck:3 (dev-cluster2-node2) >>> 6. In this state, we spend the next 600 seconds. >>> After completing this timeout causes another node (not DC) decides to >>> kill again our victim. >>> All command again is logged both the victim and the killer - All >>> documented :) >>> 7. NOW all resource started in right sequence. >>> >>> I almost happy, but I do not like: two reboots and 10 minutes of waiting >>> ;) >>> And if something happens on another node, this the behavior is >>> superimposed on old and not any resources not start until the last node >>> will not reload twice. >>> >>> I tried understood this behavior. >>> As I understand it: >>> 1. Ultimately, in ./lib/fencing/st_client.c call >>> internal_stonith_action_execute(). >>> 2. It make fork and pipe from tham. >>> 3. Async call mainloop_child_add with callback to >>> stonith_action_async_done. >>> 4. Add timeout g_timeout_add to TERM and KILL signals. >>> >>> If all right must - call stonith_action_async_done, remove timeout. >>> For some reason this does not happen. I sit and think .... >>>>> At this time, there are constant re-election. >>>>> Also, I noticed the difference when you start pacemaker. >>>>> At normal startup: >>>>> * corosync >>>>> * pacemakerd >>>>> * attrd >>>>> * pengine >>>>> * lrmd >>>>> * crmd >>>>> * cib >>>>> >>>>> When hangs start: >>>>> * corosync >>>>> * pacemakerd >>>>> * attrd >>>>> * pengine >>>>> * crmd >>>>> * lrmd >>>>> * cib. >>>> Are you referring to the order of the daemons here? >>>> The cib should not be at the bottom in either case. >>>>> Who knows who runs lrmd? >>>> Pacemakerd. >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> , >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org