On 9 Jan 2014, at 11:11 pm, Andrey Groshev <gre...@yandex.ru> wrote:
> > > 08.01.2014, 06:22, "Andrew Beekhof" <and...@beekhof.net>: >> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <gre...@yandex.ru> wrote: >> >>> Hi, ALL. >>> >>> I'm still trying to cope with the fact that after the fence - node hangs >>> in "pending". >> >> Please define "pending". Where did you see this? > In crm_mon: > ...... > Node dev-cluster2-node2 (172793105): pending > ...... > > > The experiment was like this: > Four nodes in cluster. > On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11). > Thereafter, the remaining start it constantly reboot, under various pretexts, > "softly whistling", "fly low", "not a cluster member!" ... > Then in the log fell out "Too many failures ...." > All this time in the status in crm_mon is "pending". > Depending on the wind direction changed to "UNCLEAN" > Much time has passed and I can not accurately describe the behavior... > > Now I am in the following state: > I tried locate the problem. Came here with this. > I set big value in property stonith-timeout="600s". > And got the following behavior: > 1. pkill -4 corosync > 2. from node with DC call my fence agent "sshbykey" > 3. It sends reboot victim and waits until she comes to life again. Hmmm.... what version of pacemaker? This sounds like a timing issue that we fixed a while back > Once the script makes sure that the victim will rebooted and again > available via ssh - it exit with 0. > All command is logged both the victim and the killer - all right. > 4. A little later, the status of the (victim) nodes in crm_mon changes to > online. > 5. BUT... not one resource don't start! Despite the fact that "crm_simalate > -sL" shows the correct resource to start: > * Start pingCheck:3 (dev-cluster2-node2) > 6. In this state, we spend the next 600 seconds. > After completing this timeout causes another node (not DC) decides to kill > again our victim. > All command again is logged both the victim and the killer - All documented > :) > 7. NOW all resource started in right sequence. > > I almost happy, but I do not like: two reboots and 10 minutes of waiting ;) > And if something happens on another node, this the behavior is superimposed > on old and not any resources not start until the last node will not reload > twice. > > I tried understood this behavior. > As I understand it: > 1. Ultimately, in ./lib/fencing/st_client.c call > internal_stonith_action_execute(). > 2. It make fork and pipe from tham. > 3. Async call mainloop_child_add with callback to stonith_action_async_done. > 4. Add timeout g_timeout_add to TERM and KILL signals. > > If all right must - call stonith_action_async_done, remove timeout. > For some reason this does not happen. I sit and think .... > > > > >>> At this time, there are constant re-election. >>> Also, I noticed the difference when you start pacemaker. >>> At normal startup: >>> * corosync >>> * pacemakerd >>> * attrd >>> * pengine >>> * lrmd >>> * crmd >>> * cib >>> >>> When hangs start: >>> * corosync >>> * pacemakerd >>> * attrd >>> * pengine >>> * crmd >>> * lrmd >>> * cib. >> >> Are you referring to the order of the daemons here? >> The cib should not be at the bottom in either case. >> >>> Who knows who runs lrmd? >> >> Pacemakerd. >> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> , >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org