17.01.2012 04:01, Andrew Beekhof wrote: > On Mon, Jan 16, 2012 at 5:45 PM, Vladislav Bogdanov > <bub...@hoster-ok.com> wrote: >> 16.01.2012 09:20, Andrew Beekhof wrote: >> [snip] >>>>> At the same time, stonith_admin -B succeeds. >>>>> The main difference I see is st_opt_sync_call in a latter case. >>>>> Will try to experiment with it. >>>> >>>> Yeeeesssss!!! >>>> >>>> Now I see following: >>>> Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info: >>>> pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be >>>> fenced >>> >>> So the important question... what did you change? >> >> Nice you're back ;) >> >> + rc = st->cmds->fence(st, *st_opt_sync_call*, node_uname, "reboot", 120); > > Really struggling to see how changing anything here can impact whether > the log message /before/ it gets printed.
Did I say it? ;) Line of the interest here is not Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info: pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced which was added by me it that function, but the next one: Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: initiate_remote_stonith_op: Initiating remote operation reboot for vd01-b: 21425fc0-4311-40fa-9647-525c3f258471 which indicates that fencing is fired (and the rest). > >> >> attaching my resulting version of pacemaker.c (which still has a lot of >> mess because of different approaches I tried to get the result and needs >> a cleanup). Function you may look at is pacemaker_terminate_member() >> which is almost one-to-one copy of crm_terminate_member_no_mainloop() >> except rename of variable to compile without warnings and change of >> ->fence() arguments. >> >>> >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: >>>> initiate_remote_stonith_op: Initiating remote operation reboot for >>>> vd01-b: 21425fc0-4311-40fa-9647-525c3f258471 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>>> vd01-c now has id: 1107559690 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>>> Processed st_query from vd01-c: rc=0 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>>> vd01-d now has id: 1124336906 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>>> Processed st_query from vd01-d: rc=0 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command: >>>> Processed st_query from vd01-a: rc=0 >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith: >>>> Requesting that vd01-c perform op reboot vd01-b >>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node >>>> vd01-b now has id: 1090782474 >>>> ... >>>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command: >>>> Processed st_fence_history from cluster-dlm: rc=0 >>>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer >>>> vd01-b was terminated (reboot) by vd01-c for vd01-a >>>> (ref=21425fc0-4311-40fa-9647-525c3f258471): OK >>>> >>>> But, then I see minor issue that node is marked to be fenced again: >>>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b >>>> will be fenced because it is un-expectedly down >>> >>> Do you have logs for that? >>> tengine_stonith_notify() got called, that should have been enough to >>> get the node cleaned up in the cib. >> >> Ugh, seems like yes, but they are archived already. Will get them back >> to nodes and try to compose hb_report for them (but pe inputs are >> already lost, do you still need logs without them?) >> >>> >>>> ... >>>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node >>>> vd01-b for STONITH >>>> ... >>>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing >>>> reboot fencing operation (249) on vd01-b (timeout=60000) >>>> ... >>>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith: >>>> Requesting that vd01-c perform op reboot vd01-b >>>> >>>> And so on. >>>> >>>> I can't investigated this one in more depth, because I use fence_xvm in >>>> this testing cluster, and it has issues when running more than one >>>> stonith resource on a node. Also, my RA (in a cluster where this testing >>>> cluster runs) undefines VM after failure, so fence_xvm does not see >>>> fencing victim in a qpid and is unable to fence it again. >>>> >>>> May be it is possible to look if node was just fenced and skip redundant >>>> fencing? >>> >>> If the callbacks are being used correctly, it shouldn't be required >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org