Hi Angus,
I recompiled corosync with the changes you suggested in exec/main.c to generate fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata files: http://sources.xes-inc.com/downloads/core.13027 http://sources.xes-inc.com/downloads/fdata.20121106 (gdb) thread apply all bt Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 #3 0x0000555555571700 in ?? () #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 #9 0x0000555555560945 in main () I've also been doing some hardware tests to rule it out as the cause of this problem: mcelog has found no problems and memtest finds the memory to be healthy as well. Thanks, Andrew ----- Original Message ----- From: "Angus Salkeld" <asalk...@redhat.com> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Friday, November 2, 2012 8:18:51 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 02/11/12 13:07 -0500, Andrew Martin wrote: >Hi Angus, > > >Corosync died again while using libqb 0.14.3. Here is the coredump from today: >http://sources.xes-inc.com/downloads/corosync.nov2.coredump > > > ># corosync -f >notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide >service. >info [MAIN ] Corosync built-in features: pie relro bindnow >Bus error (core dumped) > > >Here's the log: http://pastebin.com/bUfiB3T3 > > >Did your analysis of the core dump reveal anything? > I can't get any symbols out of these coredumps. Can you try get a backtrace? > >Is there a way for me to make it generate fdata with a bus error, or how else >can I gather additional information to help debug this? > if you look in exec/main.c and look for SIGSEGV you will see how the mechanism for fdata works. Just and a handler for SIGBUS and hook it up. Then you should be able to get the fdata for both. I'd rather be able to get a backtrace if possible. -Angus > >Thanks, > > >Andrew > >----- Original Message ----- > >From: "Angus Salkeld" <asalk...@redhat.com> >To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >Sent: Thursday, November 1, 2012 5:47:16 PM >Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >cluster > >On 01/11/12 17:27 -0500, Andrew Martin wrote: >>Hi Angus, >> >> >>I'll try upgrading to the latest libqb tomorrow and see if I can reproduce >>this behavior with it. I was able to get a coredump by running corosync >>manually in the foreground (corosync -f): >>http://sources.xes-inc.com/downloads/corosync.coredump > >Thanks, looking... > >> >> >>There still isn't anything added to /var/lib/corosync however. What do I need >>to do to enable the fdata file to be created? > >Well if it crashes with SIGSEGV it will generate it automatically. >(I see you are getting a bus error) - :(. > >-A > >> >> >>Thanks, >> >>Andrew >> >>----- Original Message ----- >> >>From: "Angus Salkeld" <asalk...@redhat.com> >>To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>Sent: Thursday, November 1, 2012 5:11:23 PM >>Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>cluster >> >>On 01/11/12 14:32 -0500, Andrew Martin wrote: >>>Hi Honza, >>> >>> >>>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but >>>didn't have a chance to reboot and apply the changes so I don't have a core >>>dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID >>>file to be generated? right now all that is in /var/lib/corosync are the >>>ringid_XXX files. Do I need to set something explicitly in the corosync >>>config to enable this logging? >>> >>> >>>I did find find something else interesting with libqb this time. I compiled >>>libqb 0.14.2 for use with the cluster. This time when corosync died I >>>noticed the following in dmesg: >>>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide >>>error ip:7f657a52e517 sp:7fffd5068858 error:0 in >>>libqb.so.0.14.2[7f657a525000+1f000] >>>This error was only present for one of the many other times corosync has >>>died. >>> >>> >>>I see that there is a newer version of libqb (0.14.3) out, but didn't see a >>>fix for this particular bug. Could this libqb problem be related to the >>>corosync to hang up? Here's the corresponding corosync log file (next time I >>>should have a core dump as well): >>>http://pastebin.com/5FLKg7We >> >>Hi Andrew >> >>I can't see much wrong with the log either. If you could run with the latest >>(libqb-0.14.3) and post a backtrace if it still happens, that would be great. >> >>Thanks >>Angus >> >>> >>> >>>Thanks, >>> >>> >>>Andrew >>> >>>----- Original Message ----- >>> >>>From: "Jan Friesse" <jfrie...@redhat.com> >>>To: "Andrew Martin" <amar...@xes-inc.com> >>>Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" >>><pacemaker@oss.clusterlabs.org> >>>Sent: Thursday, November 1, 2012 7:55:52 AM >>>Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>> >>>Ansdrew, >>>I was not able to find anything interesting (from corosync point of >>>view) in configuration/logs (corosync related). >>> >>>What would be helpful: >>>- if corosync died, there should be >>>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >>>xz them and store somewhere (they are quiet large but well compressible). >>>- If you are able to reproduce problem (what seems like you are), can >>>you please allow generating of coredumps and store somewhere backtrace >>>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >>>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >>>here thread apply all bt). If you are running distribution with ABRT >>>support, you can also use ABRT to generate report. >>> >>>Regards, >>>Honza >>> >>>Andrew Martin napsal(a): >>>> Corosync died an additional 3 times during the night on storage1. I wrote >>>> a daemon to attempt and start it as soon as it fails, so only one of those >>>> times resulted in a STONITH of storage1. >>>> >>>> I enabled debug in the corosync config, so I was able to capture a period >>>> when corosync died with debug output: >>>> http://pastebin.com/eAmJSmsQ >>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For >>>> reference, here is my Pacemaker configuration: >>>> http://pastebin.com/DFL3hNvz >>>> >>>> It seems that an extra node, 16777343 "localhost" has been added to the >>>> cluster after storage1 was STONTIHed (must be the localhost interface on >>>> storage1). Is there anyway to prevent this? >>>> >>>> Does this help to determine why corosync is dying, and what I can do to >>>> fix it? >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> ----- Original Message ----- >>>> >>>> From: "Andrew Martin" <amar...@xes-inc.com> >>>> To: disc...@corosync.org >>>> Sent: Thursday, November 1, 2012 12:11:35 AM >>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>> >>>> >>>> Hello, >>>> >>>> I recently configured a 3-node fileserver cluster by building Corosync >>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu >>>> 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes >>>> where the resources run (a DRBD disk, filesystem mount, and samba/nfs >>>> daemons), while the third node (storagequorum) is in standby mode and acts >>>> as a quorum node for the cluster. Today I discovered that corosync died on >>>> both storage0 and storage1 at the same time. Since corosync died, >>>> pacemaker shut down as well on both nodes. Because the cluster no longer >>>> had quorum (and the no-quorum-policy="freeze"), storagequorum was unable >>>> to STONITH either node and just left the resources frozen where they were >>>> running, on storage0. I cannot find any log information to determine why >>>> corosync crashed, and this is a disturbing problem as the cluster and its >>>> messaging layer must be stable. Below is my corosync configuration file as >>>> well as the corosync log file from each! ! >n! >>o! >>>de during >>>this period. >>>> >>>> corosync.conf: >>>> http://pastebin.com/vWQDVmg8 >>>> Note that I have two redundant rings. On one of them, I specify the IP >>>> address (in this example 10.10.10.7) so that it binds to the correct >>>> interface (since potentially in the future those machines may have two >>>> interfaces on the same subnet). >>>> >>>> corosync.log from storage0: >>>> http://pastebin.com/HK8KYDDQ >>>> >>>> corosync.log from storage1: >>>> http://pastebin.com/sDWkcPUz >>>> >>>> corosync.log from storagequorum (the DC during this period): >>>> http://pastebin.com/uENQ5fnf >>>> >>>> Issuing service corosync start && service pacemaker start on storage0 and >>>> storage1 resolved the problem and allowed the nodes to successfully >>>> reconnect to the cluster. What other information can I provide to help >>>> diagnose this problem and prevent it from recurring? >>>> >>>> Thanks, >>>> >>>> Andrew Martin >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> disc...@corosync.org >>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> disc...@corosync.org >>>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >> >>>_______________________________________________ >>>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>>Project Home: http://www.clusterlabs.org >>>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>Bugs: http://bugs.clusterlabs.org >> >> >>_______________________________________________ >>discuss mailing list >>disc...@corosync.org >>http://lists.corosync.org/mailman/listinfo/discuss >> > >_______________________________________________ >discuss mailing list >disc...@corosync.org >http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org