Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Jan Friesse Wed, 07 Nov 2012 00:00:46 -0800

Andrew,

Andrew Martin napsal(a):
> Hi Angus, 
> 
> 
> I recompiled corosync with the changes you suggested in exec/main.c to 
> generate fdata when SIGBUS is triggered. Here 's the corresponding coredump 
> and fdata files: 
> http://sources.xes-inc.com/downloads/core.13027 
> http://sources.xes-inc.com/downloads/fdata.20121106


fdata are completely useless because it is truncated.

> 
> 
> 
> (gdb) thread apply all bt 
> 
> 
> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): 
> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 
> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 
> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 
> #3 0x0000555555571700 in ?? () 
> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 
> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 
> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 
> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 
> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 
> #9 0x0000555555560945 in main () 
> 
> 
Can you please compile corosync with --enable-debug so backtrace is more
complete?


Also, because you are getting this failure quiet often and reliable, can
you please run corosync in valgrind and report results? I mean something
like valgrind corosync -f (in screen for example) and then just
copy/paste output?


> 
> 
> I've also been doing some hardware tests to rule it out as the cause of this 
> problem: mcelog has found no problems and memtest finds the memory to be 
> healthy as well. 
> 

This was one of the things I wanted to recommend.

> 
> Thanks, 
> 
> 
> Andrew 
> ----- Original Message -----
> 
> From: "Angus Salkeld" <asalk...@redhat.com> 
> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
> Sent: Friday, November 2, 2012 8:18:51 PM 
> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
> cluster 
> 
> On 02/11/12 13:07 -0500, Andrew Martin wrote: 
>> Hi Angus, 
>>
>>
>> Corosync died again while using libqb 0.14.3. Here is the coredump from 
>> today: 
>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump 
>>
>>
>>
>> # corosync -f 
>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
>> provide service. 
>> info [MAIN ] Corosync built-in features: pie relro bindnow 
>> Bus error (core dumped) 
>>
>>
>> Here's the log: http://pastebin.com/bUfiB3T3 
>>
>>
>> Did your analysis of the core dump reveal anything? 
>>
> 
> I can't get any symbols out of these coredumps. Can you try get a backtrace? 
> 
>>
>> Is there a way for me to make it generate fdata with a bus error, or how 
>> else can I gather additional information to help debug this? 
>>
> 
> if you look in exec/main.c and look for SIGSEGV you will see how the 
> mechanism 
> for fdata works. Just and a handler for SIGBUS and hook it up. Then you 
> should 
> be able to get the fdata for both. 
> 
> I'd rather be able to get a backtrace if possible. 
> 
> -Angus 
> 
>>
>> Thanks, 
>>
>>
>> Andrew 
>>
>> ----- Original Message ----- 
>>
>> From: "Angus Salkeld" <asalk...@redhat.com> 
>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
>> Sent: Thursday, November 1, 2012 5:47:16 PM 
>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>> cluster 
>>
>> On 01/11/12 17:27 -0500, Andrew Martin wrote: 
>>> Hi Angus, 
>>>
>>>
>>> I'll try upgrading to the latest libqb tomorrow and see if I can reproduce 
>>> this behavior with it. I was able to get a coredump by running corosync 
>>> manually in the foreground (corosync -f): 
>>> http://sources.xes-inc.com/downloads/corosync.coredump 
>>
>> Thanks, looking... 
>>
>>>
>>>
>>> There still isn't anything added to /var/lib/corosync however. What do I 
>>> need to do to enable the fdata file to be created? 
>>
>> Well if it crashes with SIGSEGV it will generate it automatically. 
>> (I see you are getting a bus error) - :(. 
>>
>> -A 
>>
>>>
>>>
>>> Thanks, 
>>>
>>> Andrew 
>>>
>>> ----- Original Message ----- 
>>>
>>> From: "Angus Salkeld" <asalk...@redhat.com> 
>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
>>> Sent: Thursday, November 1, 2012 5:11:23 PM 
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>> cluster 
>>>
>>> On 01/11/12 14:32 -0500, Andrew Martin wrote: 
>>>> Hi Honza, 
>>>>
>>>>
>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf but 
>>>> didn't have a chance to reboot and apply the changes so I don't have a 
>>>> core dump this time. Do core dumps need to be enabled for the 
>>>> fdata-DATETIME-PID file to be generated? right now all that is in 
>>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something 
>>>> explicitly in the corosync config to enable this logging? 
>>>>
>>>>
>>>> I did find find something else interesting with libqb this time. I 
>>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync 
>>>> died I noticed the following in dmesg: 
>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide 
>>>> error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
>>>> libqb.so.0.14.2[7f657a525000+1f000] 
>>>> This error was only present for one of the many other times corosync has 
>>>> died. 
>>>>
>>>>
>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't see 
>>>> a fix for this particular bug. Could this libqb problem be related to the 
>>>> corosync to hang up? Here's the corresponding corosync log file (next time 
>>>> I should have a core dump as well): 
>>>> http://pastebin.com/5FLKg7We 
>>>
>>> Hi Andrew 
>>>
>>> I can't see much wrong with the log either. If you could run with the 
>>> latest 
>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be 
>>> great. 
>>>
>>> Thanks 
>>> Angus 
>>>
>>>>
>>>>
>>>> Thanks, 
>>>>
>>>>
>>>> Andrew 
>>>>
>>>> ----- Original Message ----- 
>>>>
>>>> From: "Jan Friesse" <jfrie...@redhat.com> 
>>>> To: "Andrew Martin" <amar...@xes-inc.com> 
>>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
>>>> <pacemaker@oss.clusterlabs.org> 
>>>> Sent: Thursday, November 1, 2012 7:55:52 AM 
>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>
>>>> Ansdrew, 
>>>> I was not able to find anything interesting (from corosync point of 
>>>> view) in configuration/logs (corosync related). 
>>>>
>>>> What would be helpful: 
>>>> - if corosync died, there should be 
>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please 
>>>> xz them and store somewhere (they are quiet large but well compressible). 
>>>> - If you are able to reproduce problem (what seems like you are), can 
>>>> you please allow generating of coredumps and store somewhere backtrace 
>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and 
>>>> here thread apply all bt). If you are running distribution with ABRT 
>>>> support, you can also use ABRT to generate report. 
>>>>
>>>> Regards, 
>>>> Honza 
>>>>
>>>> Andrew Martin napsal(a): 
>>>>> Corosync died an additional 3 times during the night on storage1. I wrote 
>>>>> a daemon to attempt and start it as soon as it fails, so only one of 
>>>>> those times resulted in a STONITH of storage1. 
>>>>>
>>>>> I enabled debug in the corosync config, so I was able to capture a period 
>>>>> when corosync died with debug output: 
>>>>> http://pastebin.com/eAmJSmsQ 
>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
>>>>> reference, here is my Pacemaker configuration: 
>>>>> http://pastebin.com/DFL3hNvz 
>>>>>
>>>>> It seems that an extra node, 16777343 "localhost" has been added to the 
>>>>> cluster after storage1 was STONTIHed (must be the localhost interface on 
>>>>> storage1). Is there anyway to prevent this? 
>>>>>
>>>>> Does this help to determine why corosync is dying, and what I can do to 
>>>>> fix it? 
>>>>>
>>>>> Thanks, 
>>>>>
>>>>> Andrew 
>>>>>
>>>>> ----- Original Message ----- 
>>>>>
>>>>> From: "Andrew Martin" <amar...@xes-inc.com> 
>>>>> To: disc...@corosync.org 
>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM 
>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>>
>>>>>
>>>>> Hello, 
>>>>>
>>>>> I recently configured a 3-node fileserver cluster by building Corosync 
>>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running 
>>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" 
>>>>> nodes where the resources run (a DRBD disk, filesystem mount, and 
>>>>> samba/nfs daemons), while the third node (storagequorum) is in standby 
>>>>> mode and acts as a quorum node for the cluster. Today I discovered that 
>>>>> corosync died on both storage0 and storage1 at the same time. Since 
>>>>> corosync died, pacemaker shut down as well on both nodes. Because the 
>>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), 
>>>>> storagequorum was unable to STONITH either node and just left the 
>>>>> resources frozen where they were running, on storage0. I cannot find any 
>>>>> log information to determine why corosync crashed, and this is a 
>>>>> disturbing problem as the cluster and its messaging layer must be stable. 
>>>>> Below is my corosync configuration file as well as the corosync log file 
>>>>> from eac!
 h! 
> ! 
>> n! 
>>> o! 
>>>> de during 
>>>> this period. 
>>>>>
>>>>> corosync.conf: 
>>>>> http://pastebin.com/vWQDVmg8 
>>>>> Note that I have two redundant rings. On one of them, I specify the IP 
>>>>> address (in this example 10.10.10.7) so that it binds to the correct 
>>>>> interface (since potentially in the future those machines may have two 
>>>>> interfaces on the same subnet). 
>>>>>
>>>>> corosync.log from storage0: 
>>>>> http://pastebin.com/HK8KYDDQ 
>>>>>
>>>>> corosync.log from storage1: 
>>>>> http://pastebin.com/sDWkcPUz 
>>>>>
>>>>> corosync.log from storagequorum (the DC during this period): 
>>>>> http://pastebin.com/uENQ5fnf 
>>>>>
>>>>> Issuing service corosync start && service pacemaker start on storage0 and 
>>>>> storage1 resolved the problem and allowed the nodes to successfully 
>>>>> reconnect to the cluster. What other information can I provide to help 
>>>>> diagnose this problem and prevent it from recurring? 
>>>>>
>>>>> Thanks, 
>>>>>
>>>>> Andrew Martin 
>>>>>
>>>>> _______________________________________________ 
>>>>> discuss mailing list 
>>>>> disc...@corosync.org 
>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________ 
>>>>> discuss mailing list 
>>>>> disc...@corosync.org 
>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>
>>>>
>>>
>>>> _______________________________________________ 
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>>
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>>
>>>
>>> _______________________________________________ 
>>> discuss mailing list 
>>> disc...@corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>
>>
>> _______________________________________________ 
>> discuss mailing list 
>> disc...@corosync.org 
>> http://lists.corosync.org/mailman/listinfo/discuss 
>>
> 
> _______________________________________________ 
> discuss mailing list 
> disc...@corosync.org 
> http://lists.corosync.org/mailman/listinfo/discuss 
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> disc...@corosync.org
> http://lists.corosync.org/mailman/listinfo/discuss
> 


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to