On Fri, Jan 14, 2011 at 4:59 PM, Bob Haxo <bh...@sgi.com> wrote: > >> Where there (m)any logs containing the text "crm_abort" ... > Sorry Andrew, > > Since I'm testing installations, all of the nodes in the cluster have > been installed several times since I solved this issue, and the original > log files are gone. > > I did not see "crm_abort" logged, otherwise I would have captured the > messages in my notes. > > I searched my notes (to be certain), and I searched the history of all > of the windows that I had been tailing the messages files without > finding a single instance of the string "crm_abort". Some logging does > also go to the headnode of these HA clusters, but no "crm_abort" there > either.
Very strange. If you ever see the symptoms again, please see if you can figure which processes opened the file descriptors and look for any logging from them. > > Are there (by default) any logs other than in /var/log? No, that should be it. > > Bob Haxo > > > > On Fri, 2011-01-14 at 13:50 +0100, Andrew Beekhof wrote: >> On Thu, Jan 13, 2011 at 9:31 PM, Bob Haxo <bh...@sgi.com> wrote: >> > Hi Tom (and Andrew), >> > >> > I figured out an easy fix for the problem that I encountered. However, >> > there would seem to be a problem lurking in the code. >> >> Where there (m)any logs containing the text "crm_abort" from the PE in >> your history (on the bad node)? >> Thats the only way i can imagine so many copies of that file being open. >> >> > >> > Here is what I found. On one of the servers that was online and hosting >> > resources: >> > >> > r2lead1:~ # netstat -a | grep crm >> > Proto RefCnt Flags Type State I-Node Path >> > unix 2 [ ACC ] STREAM LISTENING 18659 >> > /var/run/crm/st_command >> > unix 2 [ ACC ] STREAM LISTENING 18826 >> > /var/run/crm/cib_rw >> > unix 2 [ ACC ] STREAM LISTENING 19373 /var/run/crm/crmd >> > unix 2 [ ACC ] STREAM LISTENING 18675 /var/run/crm/attrd >> > unix 2 [ ACC ] STREAM LISTENING 18694 >> > /var/run/crm/pengine >> > unix 2 [ ACC ] STREAM LISTENING 18824 >> > /var/run/crm/cib_callback >> > unix 2 [ ACC ] STREAM LISTENING 18825 >> > /var/run/crm/cib_ro >> > unix 2 [ ACC ] STREAM LISTENING 18662 >> > /var/run/crm/st_callback >> > unix 3 [ ] STREAM CONNECTED 20659 >> > /var/run/crm/cib_callback >> > unix 3 [ ] STREAM CONNECTED 20656 >> > /var/run/crm/cib_rw >> > unix 3 [ ] STREAM CONNECTED 19952 /var/run/crm/attrd >> > unix 3 [ ] STREAM CONNECTED 19944 >> > /var/run/crm/st_callback >> > unix 3 [ ] STREAM CONNECTED 19941 >> > /var/run/crm/st_command >> > unix 3 [ ] STREAM CONNECTED 19359 >> > /var/run/crm/cib_callback >> > unix 3 [ ] STREAM CONNECTED 19356 >> > /var/run/crm/cib_rw >> > unix 3 [ ] STREAM CONNECTED 19353 >> > /var/run/crm/cib_callback >> > unix 3 [ ] STREAM CONNECTED 19350 >> > /var/run/crm/cib_rw >> > >> > On the node that was failing to join the HA cluster, this command >> > returned nothing. >> > >> > However, on one of the functioning servers the above stream information >> > was returned, but included an additional ** 941 ** instances of the >> > following (with different I-Node numbers): >> > >> > unix 3 [ ] STREAM CONNECTED 1238243 >> > /var/run/crm/pengine >> > unix 3 [ ] STREAM CONNECTED 1237524 >> > /var/run/crm/pengine >> > unix 3 [ ] STREAM CONNECTED 1236698 >> > /var/run/crm/pengine >> > unix 3 [ ] STREAM CONNECTED 1235930 >> > /var/run/crm/pengine >> > unix 3 [ ] STREAM CONNECTED 1235094 >> > /var/run/crm/pengine >> > >> > Here is how I corrected the situation: >> > >> > service openais stop on the 941 pengine stream system; service openais >> > restart on the server that was failing to join the HA cluster. >> > >> > Results: >> > >> > The previously failing server joined the HA cluster and supports >> > migration of resources to that server. >> > >> > service openais start of the server that had had the 941 pengine streams >> > and that too came online. >> > >> > Regards, >> > Bob Haxo >> > >> > On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote: >> >> So, Tom ...how do you get the failed node online? >> >> >> >> I've re-installed with the same image that is running on three other >> >> nodes, but still fails. This node was quite happy for the past 3 >> >> months. As I'm testing installs, this and other nodes have been >> >> installed a significant number of times without this sort of failure. >> >> I'd whack the whole HA cluster ... except that I don't want to run into >> >> this failure again without better solution than "reinstall the >> >> system" ;-) >> >> >> >> I'm looking at the information retuned with corosync debug enabled. >> >> After startup, everything looks fine to me until hitting this apparent >> >> local ipc delivery failure: >> >> >> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3 >> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to >> >> pending delivery queue >> >> Jan 13 10:09:10 corosync [pcmk ] WARN: route_ais_message: Sending >> >> message to local.crmd failed: ipc delivery failed (rc=-2) >> >> Jan 13 10:09:10 corosync [pcmk ] Msg[6486] (dest=local:crmd, >> >> from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv >> >> origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref >> >> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue >> >> >> >> Guess that I'll have to renew my acquaintance with ipc. >> >> >> >> Bob Haxo >> >> >> >> >> >> >> >> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote: >> >> > I don't know. I still have this issue (and it seems, that I'm not the >> >> > only one...). I'll have a look, if there are pacemaker-updates through >> >> > the zypper-update-channel available (sles11-sp1). >> >> > >> >> > Regards, >> >> > Tom >> >> > >> >> > >> >> > 2011/1/13 Bob Haxo <bh...@sgi.com>: >> >> > > Tom, others, >> >> > > >> >> > > Please, what was the solution to this issue? >> >> > > >> >> > > Thanks, >> >> > > Bob Haxo >> >> > > >> >> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote: >> >> > > >> >> > > Yes, corosync is running after the reboot. It comes up with the >> >> > > regular init-procedure (runlevel 3 in my case). >> >> > > >> >> > > 2010/9/6 Andrew Beekhof <and...@beekhof.net>: >> >> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtu...@gmail.com> wrote: >> >> > >>> No, I don't have such failed-messages. In my case, the "Connection >> >> > >>> to >> >> > >>> our AIS plugin" was established. >> >> > >>> >> >> > >>> The /dev/shm is also not full. >> >> > >> >> >> > >> Is corosync running? >> >> > >> >> >> > >>> Kind regards, >> >> > >>> Tom >> >> > >>> >> >> > >>> 2010/9/3 Michael Smith <msm...@cbnco.com>: >> >> > >>>> Tom Tux wrote: >> >> > >>>> >> >> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes >> >> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not >> >> > >>>>> join >> >> > >>>>> himself automatically into the cluster. After the reboot, I have >> >> > >>>>> the >> >> > >>>>> following error- and warn-messages in the log: >> >> > >>>>> >> >> > >>>>> Sep 3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: >> >> > >>>>> live >> >> > >>>> >> >> > >>>> Do you have messages like this, too? >> >> > >>>> >> >> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]: [IPC ] Invalid IPC >> >> > >>>> credentials. >> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection: >> >> > >>>> Connection to our AIS plugin (9) failed: unknown (100) >> >> > >>>> >> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign >> >> > >>>> in to >> >> > >>>> the cluster... terminating >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e >> >> > >>>> >> >> > >>>> Mike >> >> > >>>> >> >> > >>>> _______________________________________________ >> >> > >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> > >>>> >> >> > >>>> Project Home: http://www.clusterlabs.org >> >> > >>>> Getting started: >> >> > >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> > >>>> Bugs: >> >> > >>>> >> >> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > >>>> >> >> > >>> >> >> > >>> _______________________________________________ >> >> > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> > >>> >> >> > >>> Project Home: http://www.clusterlabs.org >> >> > >>> Getting started: >> >> > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> > >>> Bugs: >> >> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > >>> >> >> > >> >> >> > >> _______________________________________________ >> >> > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> > >> >> >> > >> Project Home: http://www.clusterlabs.org >> >> > >> Getting started: >> >> > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> > >> Bugs: >> >> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > >> >> >> > > >> >> > > _______________________________________________ >> >> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> >> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> > > >> >> > > Project Home: http://www.clusterlabs.org >> >> > > Getting started: >> >> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> > > Bugs: >> >> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > > >> > >> > >> > _______________________________________________ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: >> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker