Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-08 Thread Andrew Martin
Honza and Angus, Glad to hear about this possible breakthrough! Here's the output of df: root@storage1:~# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg00-lv_root 228424996 3376236 213445408 2% / udev 3041428 4 3041424 1% /dev tmpfs 1220808 340 1220468 1% /run none 5120 8

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-08 Thread Jan Friesse
Andrew, good news. I believe that I've found reproducer for problem you are facing. Now, to be sure it's really same, can you please run : df (interesting is /dev/shm) and send output of ls -la /dev/shm? I believe /dev/shm is full. Now, as a quick workaround, just delete all qb-* from /dev/shm an

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-08 Thread Jan Friesse
Andrew, thanks for valgrind report (even it didn't showed anything useful) and blackbox. We believe that problem is because of access to invalid memory mapped by mmap operation. There are basically 3 places where we are doing mmap. 1.) corosync cpg_zcb functions (I don't believe this is the case)

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Andrew Martin
Angus and Honza, I recompiled corosync with --enable-debug. Below is a capture of the valgrind output when corosync dies, after switching rrp_mode to passive: # valgrind corosync -f ==5453== Memcheck, a memory error detector ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et a

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
Andrew, Andrew Martin napsal(a): > A bit more data on this problem: I was doing some maintenance and had to > briefly disconnect storagequorum's connection to the STONITH network > (ethernet cable #7 in this diagram): > http://sources.xes-inc.com/downloads/storagecluster.png > > > Since coro

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
Andrew, Andrew Martin napsal(a): > Hi Angus, > > > I recompiled corosync with the changes you suggested in exec/main.c to > generate fdata when SIGBUS is triggered. Here 's the corresponding coredump > and fdata files: > http://sources.xes-inc.com/downloads/core.13027 > http://sources.xes-i

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-06 Thread Angus Salkeld
On 06/11/12 17:47 -0600, Andrew Martin wrote: A bit more data on this problem: I was doing some maintenance and had to briefly disconnect storagequorum's connection to the STONITH network (ethernet cable #7 in this diagram): http://sources.xes-inc.com/downloads/storagecluster.png Since corosy

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-06 Thread Andrew Martin
A bit more data on this problem: I was doing some maintenance and had to briefly disconnect storagequorum's connection to the STONITH network (ethernet cable #7 in this diagram): http://sources.xes-inc.com/downloads/storagecluster.png Since corosync has two rings (and is in active mode), this s

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-06 Thread Andrew Martin
Hi Angus, I recompiled corosync with the changes you suggested in exec/main.c to generate fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata files: http://sources.xes-inc.com/downloads/core.13027 http://sources.xes-inc.com/downloads/fdata.20121106 (gdb) thread apply

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-05 Thread Andrew Martin
jfrie...@redhat.com > To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Monday, November 5, 2012 2:21:09 AM Subject: Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster Angus Salkeld napsal(a): > On 02/11/12 13:07 -0500, Andrew Martin wrote: >>

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-05 Thread Jan Friesse
Angus Salkeld napsal(a): > On 02/11/12 13:07 -0500, Andrew Martin wrote: >> Hi Angus, >> >> >> Corosync died again while using libqb 0.14.3. Here is the coredump >> from today: >> http://sources.xes-inc.com/downloads/corosync.nov2.coredump >> >> >> >> # corosync -f >> notice [MAIN ] Corosync Cluste

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-02 Thread Angus Salkeld
On 02/11/12 13:07 -0500, Andrew Martin wrote: Hi Angus, Corosync died again while using libqb 0.14.3. Here is the coredump from today: http://sources.xes-inc.com/downloads/corosync.nov2.coredump # corosync -f notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide ser

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-02 Thread Andrew Martin
Hi Angus, Corosync died again while using libqb 0.14.3. Here is the coredump from today: http://sources.xes-inc.com/downloads/corosync.nov2.coredump # corosync -f notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. info [MAIN ] Corosync built-in features: p

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Angus Salkeld
On 01/11/12 17:27 -0500, Andrew Martin wrote: Hi Angus, I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f): http://sources.xes-inc.com/downloads/corosync.co

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Andrew Martin
Hi Angus, I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f): http://sources.xes-inc.com/downloads/corosync.coredump There still isn't anything added to /va

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Angus Salkeld
On 01/11/12 14:32 -0500, Andrew Martin wrote: Hi Honza, Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to b

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Andrew Martin
Hi Honza, Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/c

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Jan Friesse
Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related). What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Andrew Martin
Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1. I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug o