On Mon, Jan 16, 2012 at 5:45 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 16.01.2012 09:20, Andrew Beekhof wrote: > [snip] >>>> At the same time, stonith_admin -B succeeds. >>>> The main difference I see is st_opt_sync_call in a latter case. >>>> Will try to experiment with it. >>> >>> Yeeeesssss!!! >>> >>> Now I see following: >>> Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info: >>> pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced >> >> So the important question... what did you change? > > Nice you're back ;) > > + rc = st->cmds->fence(st, *st_opt_sync_call*, node_uname, "reboot", 120);
Really struggling to see how changing anything here can impact whether the log message /before/ it gets printed. > > attaching my resulting version of pacemaker.c (which still has a lot of > mess because of different approaches I tried to get the result and needs > a cleanup). Function you may look at is pacemaker_terminate_member() > which is almost one-to-one copy of crm_terminate_member_no_mainloop() > except rename of variable to compile without warnings and change of > ->fence() arguments. > >> >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: >>> initiate_remote_stonith_op: Initiating remote operation reboot for >>> vd01-b: 21425fc0-4311-40fa-9647-525c3f258471 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>> vd01-c now has id: 1107559690 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>> Processed st_query from vd01-c: rc=0 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>> vd01-d now has id: 1124336906 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>> Processed st_query from vd01-d: rc=0 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>> Processed st_query from vd01-a: rc=0 >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith: >>> Requesting that vd01-c perform op reboot vd01-b >>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>> vd01-b now has id: 1090782474 >>> ... >>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command: >>> Processed st_fence_history from cluster-dlm: rc=0 >>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer >>> vd01-b was terminated (reboot) by vd01-c for vd01-a >>> (ref=21425fc0-4311-40fa-9647-525c3f258471): OK >>> >>> But, then I see minor issue that node is marked to be fenced again: >>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b >>> will be fenced because it is un-expectedly down >> >> Do you have logs for that? >> tengine_stonith_notify() got called, that should have been enough to >> get the node cleaned up in the cib. > > Ugh, seems like yes, but they are archived already. Will get them back > to nodes and try to compose hb_report for them (but pe inputs are > already lost, do you still need logs without them?) > >> >>> ... >>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node >>> vd01-b for STONITH >>> ... >>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing >>> reboot fencing operation (249) on vd01-b (timeout=60000) >>> ... >>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith: >>> Requesting that vd01-c perform op reboot vd01-b >>> >>> And so on. >>> >>> I can't investigated this one in more depth, because I use fence_xvm in >>> this testing cluster, and it has issues when running more than one >>> stonith resource on a node. Also, my RA (in a cluster where this testing >>> cluster runs) undefines VM after failure, so fence_xvm does not see >>> fencing victim in a qpid and is unable to fence it again. >>> >>> May be it is possible to look if node was just fenced and skip redundant >>> fencing? >> >> If the callbacks are being used correctly, it shouldn't be required > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org