SOLVED! [more below]
On 01/08/14 06:30, Andrew Beekhof wrote: > On 1 Aug 2014, at 2:04 pm, Andrew Beekhof <and...@beekhof.net> wrote: > >> On 1 Aug 2014, at 7:47 am, Andrew Beekhof <and...@beekhof.net> wrote: >> >>> On 31 Jul 2014, at 4:46 pm, Cédric Dufour - Idiap Research Institute >>> <cedric.duf...@idiap.ch> wrote: >>> >>>> On 31/07/14 00:17, Andrew Beekhof wrote: >>>>> On 31 Jul 2014, at 2:48 am, Cédric Dufour - Idiap Research Institute >>>>> <cedric.duf...@idiap.ch> wrote: >>>>> >>>>>> After packaging pacemaker 1.1.12 for Debian/Wheezy (along corosync 1.4.6 >>>>>> and libqb 0.17.0), I have successfully initialized a new cluster. >>>>>> >>>>>> Back to a very simple test cluster, the only problem I have is with >>>>>> fencing, which fails altogether with "route_ais_message: Sending message >>>>>> to local.stonith-ng failed: ipc delivery failed (rc=-2)" messages: >>>>>> >>>>>> root@bc1hs22a01:~ # tail /var/log/corosync.rsyslog >>>>>> Jul 30 18:41:41 bc1hs22a01 stonith_admin[5411]: notice: crm_log_args: >>>>>> Invoked: stonith_admin -F bc1hs22a02 >>>>>> Jul 30 18:41:41 bc1hs22a01 stonithd[4754]: notice: handle_request: >>>>>> Client stonith_admin.5411.fe1388ed wants to fence (off) 'bc1hs22a02' >>>>>> with device '(any)' >>>>>> Jul 30 18:41:41 bc1hs22a01 stonithd[4754]: notice: >>>>>> initiate_remote_stonith_op: Initiating remote operation off for >>>>>> bc1hs22a02: 48b69f82-29ad-4c9a-af57-0e60ae5242e4 (0) >>>>>> Jul 30 18:41:41 bc1hs22a01 corosync[4686]: [pcmk ] WARN: >>>>>> route_ais_message: Sending message to local.stonith-ng failed: ipc >>>>>> delivery failed (rc=-2) >>>>> rc=-2 is coming from send_client_ipc(void *conn, const AIS_Message * >>>>> ais_msg) >>>>> >>>>> specifically: >>>>> >>>>> if (conn == NULL) { >>>>> rc = -2; >>>>> >>>>> So the plugin thinks that stonith-ng isn't connected. >>>>> More logs? >>>>> >>>> I have completed a full restart of the cluster in order to provide the >>>> logs at each step; see attached log files: >>>> (from node_1/DC) >>>> - node_1-corosync-start.log >>>> - node_1-pacemaker-start.log >>>> - node_1-corosync-node_2_join.log >>>> - node_1-pacemaker-node_2_join.log >>>> (from node_2) >>>> - node_2-corosync-start.log >>>> - node_2-pacemaker-start.log >>>> >>>> The problem manifests itself already in DC start log - because of previous >>>> fencing attempt - at 08:19:21 and 08:19:42: >>>> >>>> root@bc1hs22a01:~ # fgrep 'ipc delivery failed' node_1-corosync-start.log >>>> Jul 31 08:19:21 bc1hs22a01 corosync[31057]: [pcmk ] WARN: >>>> route_ais_message: Sending message to local.stonith-ng failed: ipc >>>> delivery failed (rc=-2) >>>> Jul 31 08:19:42 bc1hs22a01 corosync[31057]: [pcmk ] WARN: >>>> route_ais_message: Sending message to local.stonith-ng failed: ipc >>>> delivery failed (rc=-2) >>>> >>>> While it would seem (to me) that the stonith plugin successfully connected >>>> to the CIB: >>> Its not the CIB thats the issue: >>> >>>>>> Jul 30 18:41:41 bc1hs22a01 corosync[4686]: [pcmk ] WARN: >>>>>> route_ais_message: Sending message to local.stonith-ng failed: ipc >>>>>> delivery failed (rc=-2) >>> Thats the pacemaker plugin inside corosync (which uses a completely >>> different IPC mechanism). >> It looks like there is a name mismatch: >> >> Jul 31 08:19:20 bc1hs22a01 corosync[31057]: [pcmk ] info: pcmk_ipc: >> Recorded connection 0x2543e30 for stonithd/0 >> Jul 31 08:19:20 bc1hs22a01 corosync[31057]: [pcmk ] debug: >> process_ais_message: Msg[1] (dest=local:ais, from=bc1hs22a01:stonithd.31092, >> remote=true, size=6): 31092 >> ... >> Jul 31 08:19:21 bc1hs22a01 corosync[31057]: [pcmk ] WARN: >> route_ais_message: Sending message to local.stonith-ng failed: ipc delivery >> failed (rc=-2) >> Jul 31 08:19:42 bc1hs22a01 corosync[31057]: [pcmk ] WARN: >> route_ais_message: Sending message to local.stonith-ng failed: ipc delivery >> failed (rc=-2) >> >> Could you try the following patch? > Actually, try this one instead: > https://github.com/beekhof/pacemaker/commit/21830a0 This one-line patch did it: Aug 1 09:48:26 bc1hs22a01 corosync[15681]: [pcmk ] info: pcmk_ipc: Recorded connection 0x1a926c0 for stonith-ng/0 Aug 1 09:48:26 bc1hs22a01 corosync[15681]: [pcmk ] info: pcmk_ipc: Sending membership update 120 to stonith-ng And (previously attempted/recorded) fencing command worked as soon as the DC started. Thank you very much for your quick response! (I can now enjoy Switzerland National Day with total peace of mind :-) ) PS: I'll carry out further cluster/fencing tests nest week (should you want a thorougher confirmation before pushing your patch to master) >>> FWIW, the plugin is extremely deprecated, you're encouraged to use >>> pacemaker+cman or begin working towards corosync2 + pacemakerd. >>> >>> I'll keep this in mind (but not so easy to achieve when one is willing to not stray too far from Debian "stable"). Best and thanks again, Cédric _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org