Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Andrew Martin Thu, 08 Nov 2012 06:22:57 -0800

Honza and Angus,


Glad to hear about this possible breakthrough! Here's the output of df:

root@storage1:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg00-lv_root 228424996 3376236 213445408 2% /
udev 3041428 4 3041424 1% /dev
tmpfs 1220808 340 1220468 1% /run
none 5120 8 5112 1% /run/lock
none 3052016 160652 2891364 6% /run/shm
/dev/sda1 112039 88040 18214 83% /boot
root@storage1:~# ls -la /dev/shm
lrwxrwxrwx 1 root root 8 Nov 6 08:11 /dev/shm -> /run/shm



root@storage0:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg00-lv_root 228424996 140301080 76520564 65% /
udev 3041264 4 3041260 1% /dev
tmpfs 1220808 356 1220452 1% /run
none 5120 4 5116 1% /run/lock
none 3052012 37868 3014144 2% /run/shm
/dev/sda1 112039 88973 17281 84% /boot

root@storage0:~# ls -la /dev/shm
lrwxrwxrwx 1 root root 8 Nov 7 21:07 /dev/shm -> /run/shm



root@storagequorum:~# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 77012644 4014620 69140924 6% /
udev 467564 4 467560 1% /dev
tmpfs 190548 384 190164 1% /run
none 5120 0 5120 0% /run/lock
none 476368 53260 423108 12% /run/shm
root@storagequorum:~# ls -la /dev/shm
lrwxrwxrwx 1 root root 8 Sep 12 12:42 /dev/shm -> /run/shm


It isn't full now, but corosync has been dead on storage1 for several hours. I 
am running it in the foreground again this morning to try and reproduce a 
higher used value for /run/shm.


I will also compile corosync from git for evaluating the IPC possibility.


Thanks,


Andrew
----- Original Message -----

From: "Jan Friesse" <jfrie...@redhat.com>
To: "Andrew Martin" <amar...@xes-inc.com>
Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, 
pacemaker@oss.clusterlabs.org
Sent: Thursday, November 8, 2012 7:39:45 AM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

Andrew,
good news. I believe that I've found reproducer for problem you are
facing. Now, to be sure it's really same, can you please run :
df (interesting is /dev/shm)
and send output of ls -la /dev/shm?

I believe /dev/shm is full.

Now, as a quick workaround, just delete all qb-* from /dev/shm and
cluster should work. There are basically two problems:
- ipc_shm is leaking memory
- if there is no memory, libqb mmap nonallocated memory and receives sigbus

Angus is working on both issues.

Regards,
Honza

Jan Friesse napsal(a):
> Andrew,
> thanks for valgrind report (even it didn't showed anything useful) and
> blackbox.
>
> We believe that problem is because of access to invalid memory mapped by 
> mmap operation. There are basically 3 places where we are doing mmap.
> 1.) corosync cpg_zcb functions (I don't believe this is the case)
> 2.) LibQB IPC
> 3.) LibQB blackbox
>
> Now, because nether me nor Angus are able to reproduce the bug, can you
> please:
> - apply patches "Check successful initialization of IPC" and "Add
> support for selecting IPC type" (later versions), or use corosync from
> git (ether needle or master branch, they are same)
> - compile corosync
> - Add
>
> qb {
> ipc_type: socket
> }
>
> to corosync.conf
> - Try running corosync
>
> This may, but may not help solve problem, but it should help us to
> diagnose if problem is or isn't IPC one.
>
> Thanks,
> Honza
>
> Andrew Martin napsal(a):
>> Angus and Honza,
>>
>>
>> I recompiled corosync with --enable-debug. Below is a capture of the 
>> valgrind output when corosync dies, after switching rrp_mode to passive:
>>
>> # valgrind corosync -f
>> ==5453== Memcheck, a memory error detector
>> ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
>> ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
>> ==5453== Command: corosync -f
>> ==5453==
>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
>> provide service.
>> info [MAIN ] Corosync built-in features: debug pie relro bindnow
>> ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised 
>> byte(s)
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82)
>> ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3BFC8: totemudp_token_send (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E38CF0: totemnet_token_send (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E40FB5: totemrrp_token_send (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3C1A4: totemudp_token_target_set (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E38EBC: totemnet_token_target_set (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== Address 0x7feff7f58 is on thread 1's stack
>> ==5453==
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
>> uninitialised byte(s)
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82)
>> ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== Address 0x7feffb9da is on thread 1's stack
>> ==5453==
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
>> uninitialised byte(s)
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82)
>> ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
>> /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0)
>> ==5453== Address 0x7feffb9da is on thread 1's stack
>> ==5453==
>> Ringbuffer:
>> ->OVERWRITE
>> ->write_pt [0]
>> ->read_pt [0]
>> ->size [2097152 words]
>> =>free [8388608 bytes]
>> =>used [0 bytes]
>> ==5453==
>> ==5453== HEAP SUMMARY:
>> ==5453== in use at exit: 13,175,149 bytes in 1,648 blocks
>> ==5453== total heap usage: 70,091 allocs, 68,443 frees, 67,724,863 bytes 
>> allocated
>> ==5453==
>> ==5453== LEAK SUMMARY:
>> ==5453== definitely lost: 0 bytes in 0 blocks
>> ==5453== indirectly lost: 0 bytes in 0 blocks
>> ==5453== possibly lost: 2,100,062 bytes in 35 blocks
>> ==5453== still reachable: 11,075,087 bytes in 1,613 blocks
>> ==5453== suppressed: 0 bytes in 0 blocks
>> ==5453== Rerun with --leak-check=full to see details of leaked memory
>> ==5453==
>> ==5453== For counts of detected and suppressed errors, rerun with: -v
>> ==5453== Use --track-origins=yes to see where uninitialised values come from
>> ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2)
>> Bus error (core dumped)
>>
>>
>> I was also able to capture non-truncated fdata:
>> http://sources.xes-inc.com/downloads/fdata-20121107
>>
>>
>> Here is the coredump:
>> http://sources.xes-inc.com/downloads/vgcore.5453
>>
>>
>> I was not able to get corosync to crash without pacemaker also running, 
>> though I was not able to test for a long period of time.
>>
>>
>> Another thing I discovered tonight was that the 127.0.1.1 entry in 
>> /etc/hosts (on both storage0 and storage1) was the source of the extra 
>> "localhost" entry in the cluster. I have removed this extraneous node so now 
>> only the 3 real nodes remain and commented out this line in /etc/hosts on 
>> all nodes in the cluster.
>> http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html
>>
>>
>> Thanks,
>>
>>
>> Andrew
>> ----- Original Message -----
>>
>> From: "Jan Friesse" <jfrie...@redhat.com>
>> To: "Andrew Martin" <amar...@xes-inc.com>
>> Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, 
>> pacemaker@oss.clusterlabs.org
>> Sent: Wednesday, November 7, 2012 2:00:20 AM
>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>> cluster
>>
>> Andrew,
>>
>> Andrew Martin napsal(a):
>>> A bit more data on this problem: I was doing some maintenance and had to 
>>> briefly disconnect storagequorum's connection to the STONITH network 
>>> (ethernet cable #7 in this diagram):
>>> http://sources.xes-inc.com/downloads/storagecluster.png
>>>
>>>
>>> Since corosync has two rings (and is in active mode), this should cause no 
>>> disruption to the cluster. However, as soon as I disconnected cable #7, 
>>> corosync on storage0 died (corosync was already stopped on storage1), which 
>>> caused pacemaker on storage0 to also shutdown. I was not able to obtain a 
>>> coredump this time as apport is still running on storage0.
>>
>> I strongly believe corosync fault is because of original problem you
>> have. Also I would recommend you to try passive mode. Passive mode is
>> better, because if one link fails, passive mode make progress (delivers 
>> messages), where active mode doesn't (up to moment, when ring is marked 
>> as failed. After that, passive/active behaves same). Also passive mode
>> is much better tested.
>>
>>>
>>>
>>> What else can I do to debug this problem? Or, should I just try to 
>>> downgrade to corosync 1.4.2 (the version available in the Ubuntu 
>>> repositories)?
>>
>> I would really like to find main issue (which looks like libqb one,
>> rather then corosync). But if you decide to downgrade, please downgrade 
>> to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs.
>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Andrew
>>
>> Regards,
>> Honza
>>
>>>
>>> ----- Original Message -----
>>>
>>> From: "Andrew Martin" <amar...@xes-inc.com>
>>> To: "Angus Salkeld" <asalk...@redhat.com>
>>> Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org
>>> Sent: Tuesday, November 6, 2012 2:01:17 PM
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>> cluster
>>>
>>>
>>> Hi Angus,
>>>
>>>
>>> I recompiled corosync with the changes you suggested in exec/main.c to 
>>> generate fdata when SIGBUS is triggered. Here 's the corresponding coredump 
>>> and fdata files:
>>> http://sources.xes-inc.com/downloads/core.13027
>>> http://sources.xes-inc.com/downloads/fdata.20121106
>>>
>>>
>>>
>>> (gdb) thread apply all bt
>>>
>>>
>>> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)):
>>> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 
>>> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0
>>> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0
>>> #3 0x0000555555571700 in ?? ()
>>> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5
>>> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5
>>> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5
>>> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0
>>> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0
>>> #9 0x0000555555560945 in main ()
>>>
>>>
>>>
>>>
>>> I've also been doing some hardware tests to rule it out as the cause of 
>>> this problem: mcelog has found no problems and memtest finds the memory to 
>>> be healthy as well.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Andrew
>>> ----- Original Message -----
>>>
>>> From: "Angus Salkeld" <asalk...@redhat.com>
>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>>> Sent: Friday, November 2, 2012 8:18:51 PM
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>> cluster
>>>
>>> On 02/11/12 13:07 -0500, Andrew Martin wrote:
>>>> Hi Angus,
>>>>
>>>>
>>>> Corosync died again while using libqb 0.14.3. Here is the coredump from 
>>>> today:
>>>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump
>>>>
>>>>
>>>>
>>>> # corosync -f
>>>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
>>>> provide service.
>>>> info [MAIN ] Corosync built-in features: pie relro bindnow
>>>> Bus error (core dumped)
>>>>
>>>>
>>>> Here's the log: http://pastebin.com/bUfiB3T3
>>>>
>>>>
>>>> Did your analysis of the core dump reveal anything?
>>>>
>>>
>>> I can't get any symbols out of these coredumps. Can you try get a backtrace?
>>>
>>>>
>>>> Is there a way for me to make it generate fdata with a bus error, or how 
>>>> else can I gather additional information to help debug this?
>>>>
>>>
>>> if you look in exec/main.c and look for SIGSEGV you will see how the 
>>> mechanism
>>> for fdata works. Just and a handler for SIGBUS and hook it up. Then you 
>>> should
>>> be able to get the fdata for both.
>>>
>>> I'd rather be able to get a backtrace if possible.
>>>
>>> -Angus
>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Andrew
>>>>
>>>> ----- Original Message -----
>>>>
>>>> From: "Angus Salkeld" <asalk...@redhat.com>
>>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>>>> Sent: Thursday, November 1, 2012 5:47:16 PM
>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>>> cluster
>>>>
>>>> On 01/11/12 17:27 -0500, Andrew Martin wrote:
>>>>> Hi Angus,
>>>>>
>>>>>
>>>>> I'll try upgrading to the latest libqb tomorrow and see if I can 
>>>>> reproduce this behavior with it. I was able to get a coredump by running 
>>>>> corosync manually in the foreground (corosync -f):
>>>>> http://sources.xes-inc.com/downloads/corosync.coredump
>>>>
>>>> Thanks, looking...
>>>>
>>>>>
>>>>>
>>>>> There still isn't anything added to /var/lib/corosync however. What do I 
>>>>> need to do to enable the fdata file to be created?
>>>>
>>>> Well if it crashes with SIGSEGV it will generate it automatically.
>>>> (I see you are getting a bus error) - :(.
>>>>
>>>> -A
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew
>>>>>
>>>>> ----- Original Message -----
>>>>>
>>>>> From: "Angus Salkeld" <asalk...@redhat.com>
>>>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>>>>> Sent: Thursday, November 1, 2012 5:11:23 PM
>>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>>>> cluster
>>>>>
>>>>> On 01/11/12 14:32 -0500, Andrew Martin wrote:
>>>>>> Hi Honza,
>>>>>>
>>>>>>
>>>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf 
>>>>>> but didn't have a chance to reboot and apply the changes so I don't have 
>>>>>> a core dump this time. Do core dumps need to be enabled for the 
>>>>>> fdata-DATETIME-PID file to be generated? right now all that is in 
>>>>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something 
>>>>>> explicitly in the corosync config to enable this logging?
>>>>>>
>>>>>>
>>>>>> I did find find something else interesting with libqb this time. I 
>>>>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync 
>>>>>> died I noticed the following in dmesg:
>>>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap 
>>>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
>>>>>> libqb.so.0.14.2[7f657a525000+1f000]
>>>>>> This error was only present for one of the many other times corosync has 
>>>>>> died.
>>>>>>
>>>>>>
>>>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't 
>>>>>> see a fix for this particular bug. Could this libqb problem be related 
>>>>>> to the corosync to hang up? Here's the corresponding corosync log file 
>>>>>> (next time I should have a core dump as well):
>>>>>> http://pastebin.com/5FLKg7We
>>>>>
>>>>> Hi Andrew
>>>>>
>>>>> I can't see much wrong with the log either. If you could run with the 
>>>>> latest
>>>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be 
>>>>> great.
>>>>>
>>>>> Thanks
>>>>> Angus
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Jan Friesse" <jfrie...@redhat.com>
>>>>>> To: "Andrew Martin" <amar...@xes-inc.com>
>>>>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
>>>>>> <pacemaker@oss.clusterlabs.org>
>>>>>> Sent: Thursday, November 1, 2012 7:55:52 AM
>>>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>>>>>
>>>>>> Ansdrew,
>>>>>> I was not able to find anything interesting (from corosync point of 
>>>>>> view) in configuration/logs (corosync related).
>>>>>>
>>>>>> What would be helpful:
>>>>>> - if corosync died, there should be
>>>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>>>>>> xz them and store somewhere (they are quiet large but well compressible).
>>>>>> - If you are able to reproduce problem (what seems like you are), can
>>>>>> you please allow generating of coredumps and store somewhere backtrace
>>>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
>>>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>>>>>> here thread apply all bt). If you are running distribution with ABRT
>>>>>> support, you can also use ABRT to generate report.
>>>>>>
>>>>>> Regards,
>>>>>> Honza
>>>>>>
>>>>>> Andrew Martin napsal(a):
>>>>>>> Corosync died an additional 3 times during the night on storage1. I 
>>>>>>> wrote a daemon to attempt and start it as soon as it fails, so only one 
>>>>>>> of those times resulted in a STONITH of storage1.
>>>>>>>
>>>>>>> I enabled debug in the corosync config, so I was able to capture a 
>>>>>>> period when corosync died with debug output:
>>>>>>> http://pastebin.com/eAmJSmsQ
>>>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. 
>>>>>>> For reference, here is my Pacemaker configuration:
>>>>>>> http://pastebin.com/DFL3hNvz
>>>>>>>
>>>>>>> It seems that an extra node, 16777343 "localhost" has been added to the 
>>>>>>> cluster after storage1 was STONTIHed (must be the localhost interface 
>>>>>>> on storage1). Is there anyway to prevent this?
>>>>>>>
>>>>>>> Does this help to determine why corosync is dying, and what I can do to 
>>>>>>> fix it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>> From: "Andrew Martin" <amar...@xes-inc.com>
>>>>>>> To: disc...@corosync.org
>>>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM
>>>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I recently configured a 3-node fileserver cluster by building Corosync 
>>>>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running 
>>>>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" 
>>>>>>> nodes where the resources run (a DRBD disk, filesystem mount, and 
>>>>>>> samba/nfs daemons), while the third node (storagequorum) is in standby 
>>>>>>> mode and acts as a quorum node for the cluster. Today I discovered that 
>>>>>>> corosync died on both storage0 and storage1 at the same time. Since 
>>>>>>> corosync died, pacemaker shut down as well on both nodes. Because the 
>>>>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), 
>>>>>>> storagequorum was unable to STONITH either node and just left the 
>>>>>>> resources frozen where they were running, on storage0. I cannot find 
>>>>>>> any log information to determine why corosync crashed, and this is a 
>>>>>>> disturbing problem as the cluster and its messaging layer must be 
>>>>>>> stable. Below is my corosync configuration file as well as the corosync 
>>>>>>> log file from e!
ac!
>> h!
>>> !
>>>> n!
>>>>> o!
>>>>>> de during
>>>>>> this period.
>>>>>>>
>>>>>>> corosync.conf:
>>>>>>> http://pastebin.com/vWQDVmg8
>>>>>>> Note that I have two redundant rings. On one of them, I specify the IP 
>>>>>>> address (in this example 10.10.10.7) so that it binds to the correct 
>>>>>>> interface (since potentially in the future those machines may have two 
>>>>>>> interfaces on the same subnet).
>>>>>>>
>>>>>>> corosync.log from storage0:
>>>>>>> http://pastebin.com/HK8KYDDQ
>>>>>>>
>>>>>>> corosync.log from storage1:
>>>>>>> http://pastebin.com/sDWkcPUz
>>>>>>>
>>>>>>> corosync.log from storagequorum (the DC during this period):
>>>>>>> http://pastebin.com/uENQ5fnf
>>>>>>>
>>>>>>> Issuing service corosync start && service pacemaker start on storage0 
>>>>>>> and storage1 resolved the problem and allowed the nodes to successfully 
>>>>>>> reconnect to the cluster. What other information can I provide to help 
>>>>>>> diagnose this problem and prevent it from recurring?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Andrew Martin
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list
>>>>>>> disc...@corosync.org
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list
>>>>>>> disc...@corosync.org
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> disc...@corosync.org
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> disc...@corosync.org
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
>>
>>
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to