Re: [Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

Andrew Beekhof Tue, 18 Jan 2011 00:18:40 -0800

On Fri, Jan 14, 2011 at 4:59 PM, Bob Haxo <bh...@sgi.com> wrote:
>
>> Where there (m)any logs containing the text "crm_abort" ...
> Sorry Andrew,
>
> Since I'm testing installations, all of the nodes in the cluster have
> been installed several times since I solved this issue, and the original
> log files are gone.
>
> I did not see "crm_abort" logged, otherwise I would have captured the
> messages in my notes.
>
> I searched my notes (to be certain), and I searched the history of all
> of the windows that I had been tailing the messages files without
> finding a single instance of the string "crm_abort". Some logging does
> also go to the headnode of these HA clusters, but no "crm_abort" there
> either.


Very strange.
If you ever see the symptoms again, please see if you can figure which
processes opened the file descriptors and look for any logging from
them.

>
> Are there (by default) any logs other than in /var/log?

No, that should be it.

>
> Bob Haxo
>
>
>
> On Fri, 2011-01-14 at 13:50 +0100, Andrew Beekhof wrote:
>> On Thu, Jan 13, 2011 at 9:31 PM, Bob Haxo <bh...@sgi.com> wrote:
>> > Hi Tom (and Andrew),
>> >
>> > I figured out an easy fix for the problem that I encountered.  However,
>> > there would seem to be a problem lurking in the code.
>>
>> Where there (m)any logs containing the text "crm_abort" from the PE in
>> your history (on the bad node)?
>> Thats the only way i can imagine so many copies of that file being open.
>>
>> >
>> > Here is what I found.  On one of the servers that was online and hosting
>> > resources:
>> >
>> > r2lead1:~ # netstat -a | grep crm
>> > Proto RefCnt Flags       Type       State         I-Node Path
>> > unix  2      [ ACC ]     STREAM     LISTENING     18659  
>> > /var/run/crm/st_command
>> > unix  2      [ ACC ]     STREAM     LISTENING     18826  
>> > /var/run/crm/cib_rw
>> > unix  2      [ ACC ]     STREAM     LISTENING     19373  /var/run/crm/crmd
>> > unix  2      [ ACC ]     STREAM     LISTENING     18675  /var/run/crm/attrd
>> > unix  2      [ ACC ]     STREAM     LISTENING     18694  
>> > /var/run/crm/pengine
>> > unix  2      [ ACC ]     STREAM     LISTENING     18824  
>> > /var/run/crm/cib_callback
>> > unix  2      [ ACC ]     STREAM     LISTENING     18825  
>> > /var/run/crm/cib_ro
>> > unix  2      [ ACC ]     STREAM     LISTENING     18662  
>> > /var/run/crm/st_callback
>> > unix  3      [ ]         STREAM     CONNECTED     20659  
>> > /var/run/crm/cib_callback
>> > unix  3      [ ]         STREAM     CONNECTED     20656  
>> > /var/run/crm/cib_rw
>> > unix  3      [ ]         STREAM     CONNECTED     19952  /var/run/crm/attrd
>> > unix  3      [ ]         STREAM     CONNECTED     19944  
>> > /var/run/crm/st_callback
>> > unix  3      [ ]         STREAM     CONNECTED     19941  
>> > /var/run/crm/st_command
>> > unix  3      [ ]         STREAM     CONNECTED     19359  
>> > /var/run/crm/cib_callback
>> > unix  3      [ ]         STREAM     CONNECTED     19356  
>> > /var/run/crm/cib_rw
>> > unix  3      [ ]         STREAM     CONNECTED     19353  
>> > /var/run/crm/cib_callback
>> > unix  3      [ ]         STREAM     CONNECTED     19350  
>> > /var/run/crm/cib_rw
>> >
>> > On the node that was failing to join the HA cluster, this command
>> > returned nothing.
>> >
>> > However, on one of the functioning servers the above stream information
>> > was returned, but included an additional ** 941 ** instances of the
>> > following (with different I-Node numbers):
>> >
>> > unix  3      [ ]         STREAM     CONNECTED     1238243 
>> > /var/run/crm/pengine
>> > unix  3      [ ]         STREAM     CONNECTED     1237524 
>> > /var/run/crm/pengine
>> > unix  3      [ ]         STREAM     CONNECTED     1236698 
>> > /var/run/crm/pengine
>> > unix  3      [ ]         STREAM     CONNECTED     1235930 
>> > /var/run/crm/pengine
>> > unix  3      [ ]         STREAM     CONNECTED     1235094 
>> > /var/run/crm/pengine
>> >
>> > Here is how I corrected the situation:
>> >
>> > service openais stop on the 941 pengine stream system; service openais
>> > restart on the server that was failing to join the HA cluster.
>> >
>> > Results:
>> >
>> > The previously failing server joined the HA cluster and supports
>> > migration of resources to that server.
>> >
>> > service openais start of the server that had had the 941 pengine streams
>> > and that too came online.
>> >
>> > Regards,
>> > Bob Haxo
>> >
>> > On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote:
>> >> So, Tom ...how do you get the failed node online?
>> >>
>> >> I've re-installed with the same image that is running on three other
>> >> nodes, but still fails.  This node was quite happy for the past 3
>> >> months.  As I'm testing installs, this and other nodes have been
>> >> installed a significant number of times without this sort of failure.
>> >> I'd whack the whole HA cluster ... except that I don't want to run into
>> >> this failure again without better solution than "reinstall the
>> >> system" ;-)
>> >>
>> >> I'm looking at the information retuned with corosync debug enabled.
>> >> After startup, everything looks fine to me until hitting this apparent
>> >> local ipc delivery failure:
>> >>
>> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3
>> >> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to 
>> >> pending delivery queue
>> >> Jan 13 10:09:10 corosync [pcmk  ] WARN: route_ais_message: Sending 
>> >> message to local.crmd failed: ipc delivery failed (rc=-2)
>> >> Jan 13 10:09:10 corosync [pcmk  ] Msg[6486] (dest=local:crmd, 
>> >> from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv 
>> >> origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref
>> >> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue
>> >>
>> >> Guess that I'll have to renew my acquaintance with ipc.
>> >>
>> >> Bob Haxo
>> >>
>> >>
>> >>
>> >> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote:
>> >> > I don't know. I still have this issue (and it seems, that I'm not the
>> >> > only one...). I'll have a look, if there are pacemaker-updates through
>> >> > the zypper-update-channel available (sles11-sp1).
>> >> >
>> >> > Regards,
>> >> > Tom
>> >> >
>> >> >
>> >> > 2011/1/13 Bob Haxo <bh...@sgi.com>:
>> >> > > Tom, others,
>> >> > >
>> >> > > Please, what was the solution to this issue?
>> >> > >
>> >> > > Thanks,
>> >> > > Bob Haxo
>> >> > >
>> >> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote:
>> >> > >
>> >> > > Yes, corosync is running after the reboot. It comes up with the
>> >> > > regular init-procedure (runlevel 3 in my case).
>> >> > >
>> >> > > 2010/9/6 Andrew Beekhof <and...@beekhof.net>:
>> >> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtu...@gmail.com> wrote:
>> >> > >>> No, I don't have such failed-messages. In my case, the "Connection 
>> >> > >>> to
>> >> > >>> our AIS plugin" was established.
>> >> > >>>
>> >> > >>> The /dev/shm is also not full.
>> >> > >>
>> >> > >> Is corosync running?
>> >> > >>
>> >> > >>> Kind regards,
>> >> > >>> Tom
>> >> > >>>
>> >> > >>> 2010/9/3 Michael Smith <msm...@cbnco.com>:
>> >> > >>>> Tom Tux wrote:
>> >> > >>>>
>> >> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes
>> >> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not 
>> >> > >>>>> join
>> >> > >>>>> himself automatically into the cluster. After the reboot, I have 
>> >> > >>>>> the
>> >> > >>>>> following error- and warn-messages in the log:
>> >> > >>>>>
>> >> > >>>>> Sep  3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: 
>> >> > >>>>> live
>> >> > >>>>
>> >> > >>>> Do you have messages like this, too?
>> >> > >>>>
>> >> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]:  [IPC   ] Invalid IPC
>> >> > >>>> credentials.
>> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection:
>> >> > >>>> Connection to our AIS plugin (9) failed: unknown (100)
>> >> > >>>>
>> >> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign 
>> >> > >>>> in to
>> >> > >>>> the cluster... terminating
>> >> > >>>>
>> >> > >>>>
>> >> > >>>>
>> >> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e
>> >> > >>>>
>> >> > >>>> Mike
>> >> > >>>>
>> >> > >>>> _______________________________________________
>> >> > >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> > >>>>
>> >> > >>>> Project Home: http://www.clusterlabs.org
>> >> > >>>> Getting started: 
>> >> > >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > >>>> Bugs:
>> >> > >>>>
>> >> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> > >>>>
>> >> > >>>
>> >> > >>> _______________________________________________
>> >> > >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> > >>>
>> >> > >>> Project Home: http://www.clusterlabs.org
>> >> > >>> Getting started: 
>> >> > >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > >>> Bugs:
>> >> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> > >>>
>> >> > >>
>> >> > >> _______________________________________________
>> >> > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> > >>
>> >> > >> Project Home: http://www.clusterlabs.org
>> >> > >> Getting started: 
>> >> > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > >> Bugs:
>> >> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> > >>
>> >> > >
>> >> > > _______________________________________________
>> >> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> > >
>> >> > > Project Home: http://www.clusterlabs.org
>> >> > > Getting started: 
>> >> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> > > Bugs:
>> >> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >> > >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: 
>> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

Reply via email to