Andrew, Andrew Martin napsal(a): > Hi Angus, > > > I recompiled corosync with the changes you suggested in exec/main.c to > generate fdata when SIGBUS is triggered. Here 's the corresponding coredump > and fdata files: > http://sources.xes-inc.com/downloads/core.13027 > http://sources.xes-inc.com/downloads/fdata.20121106
fdata are completely useless because it is truncated. > > > > (gdb) thread apply all bt > > > Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): > #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 > #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 > #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 > #3 0x0000555555571700 in ?? () > #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 > #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 > #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 > #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 > #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 > #9 0x0000555555560945 in main () > > Can you please compile corosync with --enable-debug so backtrace is more complete? Also, because you are getting this failure quiet often and reliable, can you please run corosync in valgrind and report results? I mean something like valgrind corosync -f (in screen for example) and then just copy/paste output? > > > I've also been doing some hardware tests to rule it out as the cause of this > problem: mcelog has found no problems and memtest finds the memory to be > healthy as well. > This was one of the things I wanted to recommend. > > Thanks, > > > Andrew > ----- Original Message ----- > > From: "Angus Salkeld" <asalk...@redhat.com> > To: pacemaker@oss.clusterlabs.org, disc...@corosync.org > Sent: Friday, November 2, 2012 8:18:51 PM > Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in > cluster > > On 02/11/12 13:07 -0500, Andrew Martin wrote: >> Hi Angus, >> >> >> Corosync died again while using libqb 0.14.3. Here is the coredump from >> today: >> http://sources.xes-inc.com/downloads/corosync.nov2.coredump >> >> >> >> # corosync -f >> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to >> provide service. >> info [MAIN ] Corosync built-in features: pie relro bindnow >> Bus error (core dumped) >> >> >> Here's the log: http://pastebin.com/bUfiB3T3 >> >> >> Did your analysis of the core dump reveal anything? >> > > I can't get any symbols out of these coredumps. Can you try get a backtrace? > >> >> Is there a way for me to make it generate fdata with a bus error, or how >> else can I gather additional information to help debug this? >> > > if you look in exec/main.c and look for SIGSEGV you will see how the > mechanism > for fdata works. Just and a handler for SIGBUS and hook it up. Then you > should > be able to get the fdata for both. > > I'd rather be able to get a backtrace if possible. > > -Angus > >> >> Thanks, >> >> >> Andrew >> >> ----- Original Message ----- >> >> From: "Angus Salkeld" <asalk...@redhat.com> >> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >> Sent: Thursday, November 1, 2012 5:47:16 PM >> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >> cluster >> >> On 01/11/12 17:27 -0500, Andrew Martin wrote: >>> Hi Angus, >>> >>> >>> I'll try upgrading to the latest libqb tomorrow and see if I can reproduce >>> this behavior with it. I was able to get a coredump by running corosync >>> manually in the foreground (corosync -f): >>> http://sources.xes-inc.com/downloads/corosync.coredump >> >> Thanks, looking... >> >>> >>> >>> There still isn't anything added to /var/lib/corosync however. What do I >>> need to do to enable the fdata file to be created? >> >> Well if it crashes with SIGSEGV it will generate it automatically. >> (I see you are getting a bus error) - :(. >> >> -A >> >>> >>> >>> Thanks, >>> >>> Andrew >>> >>> ----- Original Message ----- >>> >>> From: "Angus Salkeld" <asalk...@redhat.com> >>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>> Sent: Thursday, November 1, 2012 5:11:23 PM >>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>> cluster >>> >>> On 01/11/12 14:32 -0500, Andrew Martin wrote: >>>> Hi Honza, >>>> >>>> >>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf but >>>> didn't have a chance to reboot and apply the changes so I don't have a >>>> core dump this time. Do core dumps need to be enabled for the >>>> fdata-DATETIME-PID file to be generated? right now all that is in >>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something >>>> explicitly in the corosync config to enable this logging? >>>> >>>> >>>> I did find find something else interesting with libqb this time. I >>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync >>>> died I noticed the following in dmesg: >>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide >>>> error ip:7f657a52e517 sp:7fffd5068858 error:0 in >>>> libqb.so.0.14.2[7f657a525000+1f000] >>>> This error was only present for one of the many other times corosync has >>>> died. >>>> >>>> >>>> I see that there is a newer version of libqb (0.14.3) out, but didn't see >>>> a fix for this particular bug. Could this libqb problem be related to the >>>> corosync to hang up? Here's the corresponding corosync log file (next time >>>> I should have a core dump as well): >>>> http://pastebin.com/5FLKg7We >>> >>> Hi Andrew >>> >>> I can't see much wrong with the log either. If you could run with the >>> latest >>> (libqb-0.14.3) and post a backtrace if it still happens, that would be >>> great. >>> >>> Thanks >>> Angus >>> >>>> >>>> >>>> Thanks, >>>> >>>> >>>> Andrew >>>> >>>> ----- Original Message ----- >>>> >>>> From: "Jan Friesse" <jfrie...@redhat.com> >>>> To: "Andrew Martin" <amar...@xes-inc.com> >>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" >>>> <pacemaker@oss.clusterlabs.org> >>>> Sent: Thursday, November 1, 2012 7:55:52 AM >>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>> >>>> Ansdrew, >>>> I was not able to find anything interesting (from corosync point of >>>> view) in configuration/logs (corosync related). >>>> >>>> What would be helpful: >>>> - if corosync died, there should be >>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >>>> xz them and store somewhere (they are quiet large but well compressible). >>>> - If you are able to reproduce problem (what seems like you are), can >>>> you please allow generating of coredumps and store somewhere backtrace >>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >>>> here thread apply all bt). If you are running distribution with ABRT >>>> support, you can also use ABRT to generate report. >>>> >>>> Regards, >>>> Honza >>>> >>>> Andrew Martin napsal(a): >>>>> Corosync died an additional 3 times during the night on storage1. I wrote >>>>> a daemon to attempt and start it as soon as it fails, so only one of >>>>> those times resulted in a STONITH of storage1. >>>>> >>>>> I enabled debug in the corosync config, so I was able to capture a period >>>>> when corosync died with debug output: >>>>> http://pastebin.com/eAmJSmsQ >>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For >>>>> reference, here is my Pacemaker configuration: >>>>> http://pastebin.com/DFL3hNvz >>>>> >>>>> It seems that an extra node, 16777343 "localhost" has been added to the >>>>> cluster after storage1 was STONTIHed (must be the localhost interface on >>>>> storage1). Is there anyway to prevent this? >>>>> >>>>> Does this help to determine why corosync is dying, and what I can do to >>>>> fix it? >>>>> >>>>> Thanks, >>>>> >>>>> Andrew >>>>> >>>>> ----- Original Message ----- >>>>> >>>>> From: "Andrew Martin" <amar...@xes-inc.com> >>>>> To: disc...@corosync.org >>>>> Sent: Thursday, November 1, 2012 12:11:35 AM >>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>>> >>>>> >>>>> Hello, >>>>> >>>>> I recently configured a 3-node fileserver cluster by building Corosync >>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running >>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" >>>>> nodes where the resources run (a DRBD disk, filesystem mount, and >>>>> samba/nfs daemons), while the third node (storagequorum) is in standby >>>>> mode and acts as a quorum node for the cluster. Today I discovered that >>>>> corosync died on both storage0 and storage1 at the same time. Since >>>>> corosync died, pacemaker shut down as well on both nodes. Because the >>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), >>>>> storagequorum was unable to STONITH either node and just left the >>>>> resources frozen where they were running, on storage0. I cannot find any >>>>> log information to determine why corosync crashed, and this is a >>>>> disturbing problem as the cluster and its messaging layer must be stable. >>>>> Below is my corosync configuration file as well as the corosync log file >>>>> from eac! h! > ! >> n! >>> o! >>>> de during >>>> this period. >>>>> >>>>> corosync.conf: >>>>> http://pastebin.com/vWQDVmg8 >>>>> Note that I have two redundant rings. On one of them, I specify the IP >>>>> address (in this example 10.10.10.7) so that it binds to the correct >>>>> interface (since potentially in the future those machines may have two >>>>> interfaces on the same subnet). >>>>> >>>>> corosync.log from storage0: >>>>> http://pastebin.com/HK8KYDDQ >>>>> >>>>> corosync.log from storage1: >>>>> http://pastebin.com/sDWkcPUz >>>>> >>>>> corosync.log from storagequorum (the DC during this period): >>>>> http://pastebin.com/uENQ5fnf >>>>> >>>>> Issuing service corosync start && service pacemaker start on storage0 and >>>>> storage1 resolved the problem and allowed the nodes to successfully >>>>> reconnect to the cluster. What other information can I provide to help >>>>> diagnose this problem and prevent it from recurring? >>>>> >>>>> Thanks, >>>>> >>>>> Andrew Martin >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> disc...@corosync.org >>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> disc...@corosync.org >>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>>> >>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> >> _______________________________________________ >> discuss mailing list >> disc...@corosync.org >> http://lists.corosync.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list > disc...@corosync.org > http://lists.corosync.org/mailman/listinfo/discuss > > > > > _______________________________________________ > discuss mailing list > disc...@corosync.org > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org