Le 27/08/2010 16:29, Andrew Beekhof a écrit :
On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
<guillaume.chan...@connecting-nature.com>  wrote:
Hello,
sorry for the delay it took, july is not the best month to get things
working fast.
Neither is august :-)

lol sure :)
Here is the core dump file (55MB) :
http://www.connecting-nature.com/corosync/core
corosync version is 1.2.3
Sorry, but I can't do anything with that file.
Core files are only usable on the machine they came from.

you'll have to open it with gdb and type "bt" to get a backtrace.
Sorry , saw that after sending last mail. In fact i tried to debug/bt it, but
1. I'm not a c developer (i understand a little about it...)
2. I never used gdb before uh, so hard to step into the corosync debug

I'm not sure the trace will be usefull but here it is :
Core was generated by `corosync'.
Program terminated with signal 6, Aborted.
#0 0x0000003506a329a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x0000003506a329a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003506a34185 in abort () at abort.c:92
#2 0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae "token_memb_entries >= 1", file=<value optimized out>, line=1194,
    function=<value optimized out>) at assert.c:81
#3 0x00007fce14efb716 in memb_consensus_agreed (instance=0x7fce12338010) at totemsrp.c:1194 #4 0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010, memb_join=0x822bf8) at totemsrp.c:3922 #5 0x00007fce14f01a3a in message_handler_memb_join (instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at totemsrp.c:4165
#6 0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>, msg=0x822bf8, msg_len=420) at totemrrp.c:1404 #7 0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x822550)
    at totemudp.c:1244
#8 0x00007fce14ef259a in poll_run (handle=2240235047305084928) at coropoll.c:435 #9 0x0000000000405594 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1558

I tried to compile it from source (1.2.7 tag and svn trunk) but i'm unable to backtrace it as gdb tell me he doesn't find debuginfos (i did a ./configure --enable-debug but gdb seems to need a /usr/lib/debug/.build-id/... related to current executable, and i don't know how to generate this) On the 1.2.7 version, init script tell it started correctly but after one or two seconds only lrmd and pengine processes are still alive

On the trunk version, the init script fail to start (and so processes are correctly killed)

In the 1.2.7 when i'm stepping, i'm unable to go further than
service.c:201        res = service->exec_init_fn (corosync_api);
as it should create a new process for pacemaker services i think
(i don't know how to step inside this new process and debug it)

If you need/want i'll let you access this vm via ssh to test/debug it.

It should be related to other posts about "Could not connect to the CIB service: connection failed" (i saw some message related to things more or less like my problem)

I put back end of the messages log here :
Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership 208656: quorum acquired Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0) ip(192.168.0.60) ( Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has id: 83929280 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): id=83929280 state=member (new) addr=r(0) ip(192.168.0.5) votes=0 born=0 seen=20865 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node filer2.connecting-nature.com now has id: 100706496 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 is now known as filer2.connecting-nature.com Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0) ip(192.168.0.6) vo Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has id: 1174448320 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70) votes=0 born=0 seen=20 Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is operational Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_st
Aug 30 16:30:50 www01 corosync[19809]:   [TOTEM ] FAILED TO RECEIVE
Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 208656: quorum retained Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: Attempting connection to fencing daemon...
Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected
Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection failed Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS connection terminated Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS connection failed Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: stonith_peer_ais_destroy: AIS connection terminated Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Invalid argument (22) Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS connection failed Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...
Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost connection to the CIB service [19817]. Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost connection to the CIB service [19817/callback]. Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost connection to the CIB service [19817/command]. Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy: Connection to the CIB terminated... Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Invalid argument (22) Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection failed Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS connection terminated

Strange things is that crmd find the hostname for filer2.connectng-nature.com (which is the DC), but set it to <null> for all other cluster nodes

Thanks !
Guillaume


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to