Re: [Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker

Guillaume Chanaud Mon, 30 Aug 2010 07:37:43 -0700

 Le 27/08/2010 16:29, Andrew Beekhof a écrit :

On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
<guillaume.chan...@connecting-nature.com>  wrote:

Hello,
sorry for the delay it took, july is not the best month to get things
working fast.

Neither is august :-)

lol sure :)

Here is the core dump file (55MB) :
http://www.connecting-nature.com/corosync/core
corosync version is 1.2.3

Sorry, but I can't do anything with that file.
Core files are only usable on the machine they came from.

you'll have to open it with gdb and type "bt" to get a backtrace.

Sorry , saw that after sending last mail. In fact i tried to debug/btit, but

1. I'm not a c developer (i understand a little about it...)
2. I never used gdb before uh, so hard to step into the corosync debug

I'm not sure the trace will be usefull but here it is :
Core was generated by `corosync'.
Program terminated with signal 6, Aborted.

#0 0x0000003506a329a5 in raise (sig=6) at../nptl/sysdeps/unix/sysv/linux/raise.c:64

64      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt

#0 0x0000003506a329a5 in raise (sig=6) at../nptl/sysdeps/unix/sysv/linux/raise.c:64

#1  0x0000003506a34185 in abort () at abort.c:92

#2 0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae"token_memb_entries >= 1", file=<value optimized out>, line=1194,

    function=<value optimized out>) at assert.c:81

#3 0x00007fce14efb716 in memb_consensus_agreed(instance=0x7fce12338010) at totemsrp.c:1194#4 0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010,memb_join=0x822bf8) at totemsrp.c:3922#5 0x00007fce14f01a3a in message_handler_memb_join(instance=0x7fce12338010, msg=<value optimized out>, msg_len=<valueoptimized out>,

    endian_conversion_needed=<value optimized out>) at totemsrp.c:4165

#6 0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>,msg=0x822bf8, msg_len=420) at totemrrp.c:1404#7 0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>,fd=<value optimized out>, revents=<value optimized out>, data=0x822550)

    at totemudp.c:1244

#8 0x00007fce14ef259a in poll_run (handle=2240235047305084928) atcoropoll.c:435#9 0x0000000000405594 in main (argc=<value optimized out>, argv=<valueoptimized out>) at main.c:1558

I tried to compile it from source (1.2.7 tag and svn trunk) but i'munable to backtrace it as gdb tell me he doesn't find debuginfos (i dida ./configure --enable-debug but gdb seems to need a/usr/lib/debug/.build-id/... related to current executable, and i don'tknow how to generate this)On the 1.2.7 version, init script tell it started correctly but afterone or two seconds only lrmd and pengine processes are still alive

On the trunk version, the init script fail to start (and so processesare correctly killed)


In the 1.2.7 when i'm stepping, i'm unable to go further than
service.c:201        res = service->exec_init_fn (corosync_api);
as it should create a new process for pacemaker services i think
(i don't know how to step inside this new process and debug it)

If you need/want i'll let you access this vm via ssh to test/debug it.

It should be related to other posts about "Could not connect to the CIBservice: connection failed" (i saw some message related to things moreor less like my problem)


I put back end of the messages log here :

Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership208656: quorum acquiredAug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Nodewww01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0)ip(192.168.0.60) (Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> nowhas id: 83929280Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):id=83929280 state=member (new) addr=r(0) ip(192.168.0.5) votes=0 born=0seen=20865Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Nodefiler2.connecting-nature.com now has id: 100706496Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496is now known as filer2.connecting-nature.comAug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Nodefiler2.connecting-nature.com: id=100706496 state=member (new) addr=r(0)ip(192.168.0.6) voAug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> nowhas id: 1174448320Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70) votes=0born=0 seen=20Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM isoperationalAug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: Statetransition S_STARTING -> S_PENDING [ input=I_PENDINGcause=C_FSA_INTERNAL origin=do_st

Aug 30 16:30:50 www01 corosync[19809]:   [TOTEM ] FAILED TO RECEIVE

Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership208656: quorum retainedAug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith:Attempting connection to fencing daemon...

Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected

Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receivingmessage body failed: (2) Library error: Resource temporarily unavailable(11)Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connectionfailedAug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch:Receiving message body failed: (2) Library error: Resource temporarilyunavailable (11)Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AISconnection terminatedAug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AISconnection failedAug 30 16:30:52 www01 stonith-ng: [19816]: ERROR:stonith_peer_ais_destroy: AIS connection terminatedAug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receivingmessage body failed: (2) Library error: Invalid argument (22)Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AISconnection failedAug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lostconnection to OpenAIS service!

Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...

Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lostconnection to the CIB service [19817].Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lostconnection to the CIB service [19817/callback].Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lostconnection to the CIB service [19817/command].Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy:Connection to the CIB terminated...Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receivingmessage body failed: (2) Library error: Invalid argument (22)Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connectionfailedAug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AISconnection terminated

Strange things is that crmd find the hostname forfiler2.connectng-nature.com (which is the DC), but set it to <null> forall other cluster nodes


Thanks !
Guillaume


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker

Reply via email to