Andrew, thanks for valgrind report (even it didn't showed anything useful) and blackbox.
We believe that problem is because of access to invalid memory mapped by mmap operation. There are basically 3 places where we are doing mmap. 1.) corosync cpg_zcb functions (I don't believe this is the case) 2.) LibQB IPC 3.) LibQB blackbox Now, because nether me nor Angus are able to reproduce the bug, can you please: - apply patches "Check successful initialization of IPC" and "Add support for selecting IPC type" (later versions), or use corosync from git (ether needle or master branch, they are same) - compile corosync - Add qb { ipc_type: socket } to corosync.conf - Try running corosync This may, but may not help solve problem, but it should help us to diagnose if problem is or isn't IPC one. Thanks, Honza Andrew Martin napsal(a): > Angus and Honza, > > > I recompiled corosync with --enable-debug. Below is a capture of the valgrind > output when corosync dies, after switching rrp_mode to passive: > > # valgrind corosync -f > ==5453== Memcheck, a memory error detector > ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. > ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info > ==5453== Command: corosync -f > ==5453== > notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to > provide service. > info [MAIN ] Corosync built-in features: debug pie relro bindnow > ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised > byte(s) > ==5453== at 0x54D233D: ??? (syscall-template.S:82) > ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3BFC8: totemudp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E38CF0: totemnet_token_send (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E40FB5: totemrrp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3C1A4: totemudp_token_target_set (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E38EBC: totemnet_token_target_set (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== Address 0x7feff7f58 is on thread 1's stack > ==5453== > ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to > uninitialised byte(s) > ==5453== at 0x54D233D: ??? (syscall-template.S:82) > ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== Address 0x7feffb9da is on thread 1's stack > ==5453== > ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to > uninitialised byte(s) > ==5453== at 0x54D233D: ??? (syscall-template.S:82) > ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in > /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) > ==5453== Address 0x7feffb9da is on thread 1's stack > ==5453== > Ringbuffer: > ->OVERWRITE > ->write_pt [0] > ->read_pt [0] > ->size [2097152 words] > =>free [8388608 bytes] > =>used [0 bytes] > ==5453== > ==5453== HEAP SUMMARY: > ==5453== in use at exit: 13,175,149 bytes in 1,648 blocks > ==5453== total heap usage: 70,091 allocs, 68,443 frees, 67,724,863 bytes > allocated > ==5453== > ==5453== LEAK SUMMARY: > ==5453== definitely lost: 0 bytes in 0 blocks > ==5453== indirectly lost: 0 bytes in 0 blocks > ==5453== possibly lost: 2,100,062 bytes in 35 blocks > ==5453== still reachable: 11,075,087 bytes in 1,613 blocks > ==5453== suppressed: 0 bytes in 0 blocks > ==5453== Rerun with --leak-check=full to see details of leaked memory > ==5453== > ==5453== For counts of detected and suppressed errors, rerun with: -v > ==5453== Use --track-origins=yes to see where uninitialised values come from > ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) > Bus error (core dumped) > > > I was also able to capture non-truncated fdata: > http://sources.xes-inc.com/downloads/fdata-20121107 > > > Here is the coredump: > http://sources.xes-inc.com/downloads/vgcore.5453 > > > I was not able to get corosync to crash without pacemaker also running, > though I was not able to test for a long period of time. > > > Another thing I discovered tonight was that the 127.0.1.1 entry in /etc/hosts > (on both storage0 and storage1) was the source of the extra "localhost" entry > in the cluster. I have removed this extraneous node so now only the 3 real > nodes remain and commented out this line in /etc/hosts on all nodes in the > cluster. > http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html > > > > Thanks, > > > Andrew > ----- Original Message ----- > > From: "Jan Friesse" <jfrie...@redhat.com> > To: "Andrew Martin" <amar...@xes-inc.com> > Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, > pacemaker@oss.clusterlabs.org > Sent: Wednesday, November 7, 2012 2:00:20 AM > Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in > cluster > > Andrew, > > Andrew Martin napsal(a): >> A bit more data on this problem: I was doing some maintenance and had to >> briefly disconnect storagequorum's connection to the STONITH network >> (ethernet cable #7 in this diagram): >> http://sources.xes-inc.com/downloads/storagecluster.png >> >> >> Since corosync has two rings (and is in active mode), this should cause no >> disruption to the cluster. However, as soon as I disconnected cable #7, >> corosync on storage0 died (corosync was already stopped on storage1), which >> caused pacemaker on storage0 to also shutdown. I was not able to obtain a >> coredump this time as apport is still running on storage0. > > I strongly believe corosync fault is because of original problem you > have. Also I would recommend you to try passive mode. Passive mode is > better, because if one link fails, passive mode make progress (delivers > messages), where active mode doesn't (up to moment, when ring is marked > as failed. After that, passive/active behaves same). Also passive mode > is much better tested. > >> >> >> What else can I do to debug this problem? Or, should I just try to downgrade >> to corosync 1.4.2 (the version available in the Ubuntu repositories)? > > I would really like to find main issue (which looks like libqb one, > rather then corosync). But if you decide to downgrade, please downgrade > to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. > >> >> >> Thanks, >> >> >> Andrew > > Regards, > Honza > >> >> ----- Original Message ----- >> >> From: "Andrew Martin" <amar...@xes-inc.com> >> To: "Angus Salkeld" <asalk...@redhat.com> >> Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org >> Sent: Tuesday, November 6, 2012 2:01:17 PM >> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >> cluster >> >> >> Hi Angus, >> >> >> I recompiled corosync with the changes you suggested in exec/main.c to >> generate fdata when SIGBUS is triggered. Here 's the corresponding coredump >> and fdata files: >> http://sources.xes-inc.com/downloads/core.13027 >> http://sources.xes-inc.com/downloads/fdata.20121106 >> >> >> >> (gdb) thread apply all bt >> >> >> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): >> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 >> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 >> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 >> #3 0x0000555555571700 in ?? () >> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 >> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 >> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 >> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 >> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 >> #9 0x0000555555560945 in main () >> >> >> >> >> I've also been doing some hardware tests to rule it out as the cause of this >> problem: mcelog has found no problems and memtest finds the memory to be >> healthy as well. >> >> >> Thanks, >> >> >> Andrew >> ----- Original Message ----- >> >> From: "Angus Salkeld" <asalk...@redhat.com> >> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >> Sent: Friday, November 2, 2012 8:18:51 PM >> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >> cluster >> >> On 02/11/12 13:07 -0500, Andrew Martin wrote: >>> Hi Angus, >>> >>> >>> Corosync died again while using libqb 0.14.3. Here is the coredump from >>> today: >>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump >>> >>> >>> >>> # corosync -f >>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to >>> provide service. >>> info [MAIN ] Corosync built-in features: pie relro bindnow >>> Bus error (core dumped) >>> >>> >>> Here's the log: http://pastebin.com/bUfiB3T3 >>> >>> >>> Did your analysis of the core dump reveal anything? >>> >> >> I can't get any symbols out of these coredumps. Can you try get a backtrace? >> >>> >>> Is there a way for me to make it generate fdata with a bus error, or how >>> else can I gather additional information to help debug this? >>> >> >> if you look in exec/main.c and look for SIGSEGV you will see how the >> mechanism >> for fdata works. Just and a handler for SIGBUS and hook it up. Then you >> should >> be able to get the fdata for both. >> >> I'd rather be able to get a backtrace if possible. >> >> -Angus >> >>> >>> Thanks, >>> >>> >>> Andrew >>> >>> ----- Original Message ----- >>> >>> From: "Angus Salkeld" <asalk...@redhat.com> >>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>> Sent: Thursday, November 1, 2012 5:47:16 PM >>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>> cluster >>> >>> On 01/11/12 17:27 -0500, Andrew Martin wrote: >>>> Hi Angus, >>>> >>>> >>>> I'll try upgrading to the latest libqb tomorrow and see if I can reproduce >>>> this behavior with it. I was able to get a coredump by running corosync >>>> manually in the foreground (corosync -f): >>>> http://sources.xes-inc.com/downloads/corosync.coredump >>> >>> Thanks, looking... >>> >>>> >>>> >>>> There still isn't anything added to /var/lib/corosync however. What do I >>>> need to do to enable the fdata file to be created? >>> >>> Well if it crashes with SIGSEGV it will generate it automatically. >>> (I see you are getting a bus error) - :(. >>> >>> -A >>> >>>> >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> ----- Original Message ----- >>>> >>>> From: "Angus Salkeld" <asalk...@redhat.com> >>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>>> Sent: Thursday, November 1, 2012 5:11:23 PM >>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>>> cluster >>>> >>>> On 01/11/12 14:32 -0500, Andrew Martin wrote: >>>>> Hi Honza, >>>>> >>>>> >>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf >>>>> but didn't have a chance to reboot and apply the changes so I don't have >>>>> a core dump this time. Do core dumps need to be enabled for the >>>>> fdata-DATETIME-PID file to be generated? right now all that is in >>>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something >>>>> explicitly in the corosync config to enable this logging? >>>>> >>>>> >>>>> I did find find something else interesting with libqb this time. I >>>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync >>>>> died I noticed the following in dmesg: >>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap >>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in >>>>> libqb.so.0.14.2[7f657a525000+1f000] >>>>> This error was only present for one of the many other times corosync has >>>>> died. >>>>> >>>>> >>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't see >>>>> a fix for this particular bug. Could this libqb problem be related to the >>>>> corosync to hang up? Here's the corresponding corosync log file (next >>>>> time I should have a core dump as well): >>>>> http://pastebin.com/5FLKg7We >>>> >>>> Hi Andrew >>>> >>>> I can't see much wrong with the log either. If you could run with the >>>> latest >>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be >>>> great. >>>> >>>> Thanks >>>> Angus >>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> Andrew >>>>> >>>>> ----- Original Message ----- >>>>> >>>>> From: "Jan Friesse" <jfrie...@redhat.com> >>>>> To: "Andrew Martin" <amar...@xes-inc.com> >>>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" >>>>> <pacemaker@oss.clusterlabs.org> >>>>> Sent: Thursday, November 1, 2012 7:55:52 AM >>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>>> >>>>> Ansdrew, >>>>> I was not able to find anything interesting (from corosync point of >>>>> view) in configuration/logs (corosync related). >>>>> >>>>> What would be helpful: >>>>> - if corosync died, there should be >>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >>>>> xz them and store somewhere (they are quiet large but well compressible). >>>>> - If you are able to reproduce problem (what seems like you are), can >>>>> you please allow generating of coredumps and store somewhere backtrace >>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >>>>> here thread apply all bt). If you are running distribution with ABRT >>>>> support, you can also use ABRT to generate report. >>>>> >>>>> Regards, >>>>> Honza >>>>> >>>>> Andrew Martin napsal(a): >>>>>> Corosync died an additional 3 times during the night on storage1. I >>>>>> wrote a daemon to attempt and start it as soon as it fails, so only one >>>>>> of those times resulted in a STONITH of storage1. >>>>>> >>>>>> I enabled debug in the corosync config, so I was able to capture a >>>>>> period when corosync died with debug output: >>>>>> http://pastebin.com/eAmJSmsQ >>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. >>>>>> For reference, here is my Pacemaker configuration: >>>>>> http://pastebin.com/DFL3hNvz >>>>>> >>>>>> It seems that an extra node, 16777343 "localhost" has been added to the >>>>>> cluster after storage1 was STONTIHed (must be the localhost interface on >>>>>> storage1). Is there anyway to prevent this? >>>>>> >>>>>> Does this help to determine why corosync is dying, and what I can do to >>>>>> fix it? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Andrew >>>>>> >>>>>> ----- Original Message ----- >>>>>> >>>>>> From: "Andrew Martin" <amar...@xes-inc.com> >>>>>> To: disc...@corosync.org >>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM >>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> I recently configured a 3-node fileserver cluster by building Corosync >>>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running >>>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" >>>>>> nodes where the resources run (a DRBD disk, filesystem mount, and >>>>>> samba/nfs daemons), while the third node (storagequorum) is in standby >>>>>> mode and acts as a quorum node for the cluster. Today I discovered that >>>>>> corosync died on both storage0 and storage1 at the same time. Since >>>>>> corosync died, pacemaker shut down as well on both nodes. Because the >>>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), >>>>>> storagequorum was unable to STONITH either node and just left the >>>>>> resources frozen where they were running, on storage0. I cannot find any >>>>>> log information to determine why corosync crashed, and this is a >>>>>> disturbing problem as the cluster and its messaging layer must be >>>>>> stable. Below is my corosync configuration file as well as the corosync >>>>>> log file from ea! c! > h! >> ! >>> n! >>>> o! >>>>> de during >>>>> this period. >>>>>> >>>>>> corosync.conf: >>>>>> http://pastebin.com/vWQDVmg8 >>>>>> Note that I have two redundant rings. On one of them, I specify the IP >>>>>> address (in this example 10.10.10.7) so that it binds to the correct >>>>>> interface (since potentially in the future those machines may have two >>>>>> interfaces on the same subnet). >>>>>> >>>>>> corosync.log from storage0: >>>>>> http://pastebin.com/HK8KYDDQ >>>>>> >>>>>> corosync.log from storage1: >>>>>> http://pastebin.com/sDWkcPUz >>>>>> >>>>>> corosync.log from storagequorum (the DC during this period): >>>>>> http://pastebin.com/uENQ5fnf >>>>>> >>>>>> Issuing service corosync start && service pacemaker start on storage0 >>>>>> and storage1 resolved the problem and allowed the nodes to successfully >>>>>> reconnect to the cluster. What other information can I provide to help >>>>>> diagnose this problem and prevent it from recurring? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Andrew Martin >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> disc...@corosync.org >>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> disc...@corosync.org >>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>> >>>>> >>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> disc...@corosync.org >>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> >> >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss >> > > > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org