Honza and Angus,
Glad to hear about this possible breakthrough! Here's the output of df: root@storage1:~# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg00-lv_root 228424996 3376236 213445408 2% / udev 3041428 4 3041424 1% /dev tmpfs 1220808 340 1220468 1% /run none 5120 8 5112 1% /run/lock none 3052016 160652 2891364 6% /run/shm /dev/sda1 112039 88040 18214 83% /boot root@storage1:~# ls -la /dev/shm lrwxrwxrwx 1 root root 8 Nov 6 08:11 /dev/shm -> /run/shm root@storage0:~# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg00-lv_root 228424996 140301080 76520564 65% / udev 3041264 4 3041260 1% /dev tmpfs 1220808 356 1220452 1% /run none 5120 4 5116 1% /run/lock none 3052012 37868 3014144 2% /run/shm /dev/sda1 112039 88973 17281 84% /boot root@storage0:~# ls -la /dev/shm lrwxrwxrwx 1 root root 8 Nov 7 21:07 /dev/shm -> /run/shm root@storagequorum:~# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 77012644 4014620 69140924 6% / udev 467564 4 467560 1% /dev tmpfs 190548 384 190164 1% /run none 5120 0 5120 0% /run/lock none 476368 53260 423108 12% /run/shm root@storagequorum:~# ls -la /dev/shm lrwxrwxrwx 1 root root 8 Sep 12 12:42 /dev/shm -> /run/shm It isn't full now, but corosync has been dead on storage1 for several hours. I am running it in the foreground again this morning to try and reproduce a higher used value for /run/shm. I will also compile corosync from git for evaluating the IPC possibility. Thanks, Andrew ----- Original Message ----- From: "Jan Friesse" <jfrie...@redhat.com> To: "Andrew Martin" <amar...@xes-inc.com> Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, pacemaker@oss.clusterlabs.org Sent: Thursday, November 8, 2012 7:39:45 AM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster Andrew, good news. I believe that I've found reproducer for problem you are facing. Now, to be sure it's really same, can you please run : df (interesting is /dev/shm) and send output of ls -la /dev/shm? I believe /dev/shm is full. Now, as a quick workaround, just delete all qb-* from /dev/shm and cluster should work. There are basically two problems: - ipc_shm is leaking memory - if there is no memory, libqb mmap nonallocated memory and receives sigbus Angus is working on both issues. Regards, Honza Jan Friesse napsal(a): > Andrew, > thanks for valgrind report (even it didn't showed anything useful) and > blackbox. > > We believe that problem is because of access to invalid memory mapped by > mmap operation. There are basically 3 places where we are doing mmap. > 1.) corosync cpg_zcb functions (I don't believe this is the case) > 2.) LibQB IPC > 3.) LibQB blackbox > > Now, because nether me nor Angus are able to reproduce the bug, can you > please: > - apply patches "Check successful initialization of IPC" and "Add > support for selecting IPC type" (later versions), or use corosync from > git (ether needle or master branch, they are same) > - compile corosync > - Add > > qb { > ipc_type: socket > } > > to corosync.conf > - Try running corosync > > This may, but may not help solve problem, but it should help us to > diagnose if problem is or isn't IPC one. > > Thanks, > Honza > > Andrew Martin napsal(a): >> Angus and Honza, >> >> >> I recompiled corosync with --enable-debug. Below is a capture of the >> valgrind output when corosync dies, after switching rrp_mode to passive: >> >> # valgrind corosync -f >> ==5453== Memcheck, a memory error detector >> ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. >> ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info >> ==5453== Command: corosync -f >> ==5453== >> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to >> provide service. >> info [MAIN ] Corosync built-in features: debug pie relro bindnow >> ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised >> byte(s) >> ==5453== at 0x54D233D: ??? (syscall-template.S:82) >> ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3BFC8: totemudp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E38CF0: totemnet_token_send (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E40FB5: totemrrp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3C1A4: totemudp_token_target_set (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E38EBC: totemnet_token_target_set (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== Address 0x7feff7f58 is on thread 1's stack >> ==5453== >> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to >> uninitialised byte(s) >> ==5453== at 0x54D233D: ??? (syscall-template.S:82) >> ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== Address 0x7feffb9da is on thread 1's stack >> ==5453== >> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to >> uninitialised byte(s) >> ==5453== at 0x54D233D: ??? (syscall-template.S:82) >> ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in >> /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) >> ==5453== Address 0x7feffb9da is on thread 1's stack >> ==5453== >> Ringbuffer: >> ->OVERWRITE >> ->write_pt [0] >> ->read_pt [0] >> ->size [2097152 words] >> =>free [8388608 bytes] >> =>used [0 bytes] >> ==5453== >> ==5453== HEAP SUMMARY: >> ==5453== in use at exit: 13,175,149 bytes in 1,648 blocks >> ==5453== total heap usage: 70,091 allocs, 68,443 frees, 67,724,863 bytes >> allocated >> ==5453== >> ==5453== LEAK SUMMARY: >> ==5453== definitely lost: 0 bytes in 0 blocks >> ==5453== indirectly lost: 0 bytes in 0 blocks >> ==5453== possibly lost: 2,100,062 bytes in 35 blocks >> ==5453== still reachable: 11,075,087 bytes in 1,613 blocks >> ==5453== suppressed: 0 bytes in 0 blocks >> ==5453== Rerun with --leak-check=full to see details of leaked memory >> ==5453== >> ==5453== For counts of detected and suppressed errors, rerun with: -v >> ==5453== Use --track-origins=yes to see where uninitialised values come from >> ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) >> Bus error (core dumped) >> >> >> I was also able to capture non-truncated fdata: >> http://sources.xes-inc.com/downloads/fdata-20121107 >> >> >> Here is the coredump: >> http://sources.xes-inc.com/downloads/vgcore.5453 >> >> >> I was not able to get corosync to crash without pacemaker also running, >> though I was not able to test for a long period of time. >> >> >> Another thing I discovered tonight was that the 127.0.1.1 entry in >> /etc/hosts (on both storage0 and storage1) was the source of the extra >> "localhost" entry in the cluster. I have removed this extraneous node so now >> only the 3 real nodes remain and commented out this line in /etc/hosts on >> all nodes in the cluster. >> http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html >> >> >> Thanks, >> >> >> Andrew >> ----- Original Message ----- >> >> From: "Jan Friesse" <jfrie...@redhat.com> >> To: "Andrew Martin" <amar...@xes-inc.com> >> Cc: "Angus Salkeld" <asalk...@redhat.com>, disc...@corosync.org, >> pacemaker@oss.clusterlabs.org >> Sent: Wednesday, November 7, 2012 2:00:20 AM >> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >> cluster >> >> Andrew, >> >> Andrew Martin napsal(a): >>> A bit more data on this problem: I was doing some maintenance and had to >>> briefly disconnect storagequorum's connection to the STONITH network >>> (ethernet cable #7 in this diagram): >>> http://sources.xes-inc.com/downloads/storagecluster.png >>> >>> >>> Since corosync has two rings (and is in active mode), this should cause no >>> disruption to the cluster. However, as soon as I disconnected cable #7, >>> corosync on storage0 died (corosync was already stopped on storage1), which >>> caused pacemaker on storage0 to also shutdown. I was not able to obtain a >>> coredump this time as apport is still running on storage0. >> >> I strongly believe corosync fault is because of original problem you >> have. Also I would recommend you to try passive mode. Passive mode is >> better, because if one link fails, passive mode make progress (delivers >> messages), where active mode doesn't (up to moment, when ring is marked >> as failed. After that, passive/active behaves same). Also passive mode >> is much better tested. >> >>> >>> >>> What else can I do to debug this problem? Or, should I just try to >>> downgrade to corosync 1.4.2 (the version available in the Ubuntu >>> repositories)? >> >> I would really like to find main issue (which looks like libqb one, >> rather then corosync). But if you decide to downgrade, please downgrade >> to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. >> >>> >>> >>> Thanks, >>> >>> >>> Andrew >> >> Regards, >> Honza >> >>> >>> ----- Original Message ----- >>> >>> From: "Andrew Martin" <amar...@xes-inc.com> >>> To: "Angus Salkeld" <asalk...@redhat.com> >>> Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org >>> Sent: Tuesday, November 6, 2012 2:01:17 PM >>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>> cluster >>> >>> >>> Hi Angus, >>> >>> >>> I recompiled corosync with the changes you suggested in exec/main.c to >>> generate fdata when SIGBUS is triggered. Here 's the corresponding coredump >>> and fdata files: >>> http://sources.xes-inc.com/downloads/core.13027 >>> http://sources.xes-inc.com/downloads/fdata.20121106 >>> >>> >>> >>> (gdb) thread apply all bt >>> >>> >>> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): >>> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 >>> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 >>> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 >>> #3 0x0000555555571700 in ?? () >>> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 >>> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 >>> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 >>> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 >>> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 >>> #9 0x0000555555560945 in main () >>> >>> >>> >>> >>> I've also been doing some hardware tests to rule it out as the cause of >>> this problem: mcelog has found no problems and memtest finds the memory to >>> be healthy as well. >>> >>> >>> Thanks, >>> >>> >>> Andrew >>> ----- Original Message ----- >>> >>> From: "Angus Salkeld" <asalk...@redhat.com> >>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>> Sent: Friday, November 2, 2012 8:18:51 PM >>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>> cluster >>> >>> On 02/11/12 13:07 -0500, Andrew Martin wrote: >>>> Hi Angus, >>>> >>>> >>>> Corosync died again while using libqb 0.14.3. Here is the coredump from >>>> today: >>>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump >>>> >>>> >>>> >>>> # corosync -f >>>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to >>>> provide service. >>>> info [MAIN ] Corosync built-in features: pie relro bindnow >>>> Bus error (core dumped) >>>> >>>> >>>> Here's the log: http://pastebin.com/bUfiB3T3 >>>> >>>> >>>> Did your analysis of the core dump reveal anything? >>>> >>> >>> I can't get any symbols out of these coredumps. Can you try get a backtrace? >>> >>>> >>>> Is there a way for me to make it generate fdata with a bus error, or how >>>> else can I gather additional information to help debug this? >>>> >>> >>> if you look in exec/main.c and look for SIGSEGV you will see how the >>> mechanism >>> for fdata works. Just and a handler for SIGBUS and hook it up. Then you >>> should >>> be able to get the fdata for both. >>> >>> I'd rather be able to get a backtrace if possible. >>> >>> -Angus >>> >>>> >>>> Thanks, >>>> >>>> >>>> Andrew >>>> >>>> ----- Original Message ----- >>>> >>>> From: "Angus Salkeld" <asalk...@redhat.com> >>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>>> Sent: Thursday, November 1, 2012 5:47:16 PM >>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>>> cluster >>>> >>>> On 01/11/12 17:27 -0500, Andrew Martin wrote: >>>>> Hi Angus, >>>>> >>>>> >>>>> I'll try upgrading to the latest libqb tomorrow and see if I can >>>>> reproduce this behavior with it. I was able to get a coredump by running >>>>> corosync manually in the foreground (corosync -f): >>>>> http://sources.xes-inc.com/downloads/corosync.coredump >>>> >>>> Thanks, looking... >>>> >>>>> >>>>> >>>>> There still isn't anything added to /var/lib/corosync however. What do I >>>>> need to do to enable the fdata file to be created? >>>> >>>> Well if it crashes with SIGSEGV it will generate it automatically. >>>> (I see you are getting a bus error) - :(. >>>> >>>> -A >>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Andrew >>>>> >>>>> ----- Original Message ----- >>>>> >>>>> From: "Angus Salkeld" <asalk...@redhat.com> >>>>> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >>>>> Sent: Thursday, November 1, 2012 5:11:23 PM >>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >>>>> cluster >>>>> >>>>> On 01/11/12 14:32 -0500, Andrew Martin wrote: >>>>>> Hi Honza, >>>>>> >>>>>> >>>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf >>>>>> but didn't have a chance to reboot and apply the changes so I don't have >>>>>> a core dump this time. Do core dumps need to be enabled for the >>>>>> fdata-DATETIME-PID file to be generated? right now all that is in >>>>>> /var/lib/corosync are the ringid_XXX files. Do I need to set something >>>>>> explicitly in the corosync config to enable this logging? >>>>>> >>>>>> >>>>>> I did find find something else interesting with libqb this time. I >>>>>> compiled libqb 0.14.2 for use with the cluster. This time when corosync >>>>>> died I noticed the following in dmesg: >>>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap >>>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in >>>>>> libqb.so.0.14.2[7f657a525000+1f000] >>>>>> This error was only present for one of the many other times corosync has >>>>>> died. >>>>>> >>>>>> >>>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't >>>>>> see a fix for this particular bug. Could this libqb problem be related >>>>>> to the corosync to hang up? Here's the corresponding corosync log file >>>>>> (next time I should have a core dump as well): >>>>>> http://pastebin.com/5FLKg7We >>>>> >>>>> Hi Andrew >>>>> >>>>> I can't see much wrong with the log either. If you could run with the >>>>> latest >>>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be >>>>> great. >>>>> >>>>> Thanks >>>>> Angus >>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>>> Andrew >>>>>> >>>>>> ----- Original Message ----- >>>>>> >>>>>> From: "Jan Friesse" <jfrie...@redhat.com> >>>>>> To: "Andrew Martin" <amar...@xes-inc.com> >>>>>> Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" >>>>>> <pacemaker@oss.clusterlabs.org> >>>>>> Sent: Thursday, November 1, 2012 7:55:52 AM >>>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>>>> >>>>>> Ansdrew, >>>>>> I was not able to find anything interesting (from corosync point of >>>>>> view) in configuration/logs (corosync related). >>>>>> >>>>>> What would be helpful: >>>>>> - if corosync died, there should be >>>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >>>>>> xz them and store somewhere (they are quiet large but well compressible). >>>>>> - If you are able to reproduce problem (what seems like you are), can >>>>>> you please allow generating of coredumps and store somewhere backtrace >>>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >>>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >>>>>> here thread apply all bt). If you are running distribution with ABRT >>>>>> support, you can also use ABRT to generate report. >>>>>> >>>>>> Regards, >>>>>> Honza >>>>>> >>>>>> Andrew Martin napsal(a): >>>>>>> Corosync died an additional 3 times during the night on storage1. I >>>>>>> wrote a daemon to attempt and start it as soon as it fails, so only one >>>>>>> of those times resulted in a STONITH of storage1. >>>>>>> >>>>>>> I enabled debug in the corosync config, so I was able to capture a >>>>>>> period when corosync died with debug output: >>>>>>> http://pastebin.com/eAmJSmsQ >>>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. >>>>>>> For reference, here is my Pacemaker configuration: >>>>>>> http://pastebin.com/DFL3hNvz >>>>>>> >>>>>>> It seems that an extra node, 16777343 "localhost" has been added to the >>>>>>> cluster after storage1 was STONTIHed (must be the localhost interface >>>>>>> on storage1). Is there anyway to prevent this? >>>>>>> >>>>>>> Does this help to determine why corosync is dying, and what I can do to >>>>>>> fix it? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Andrew >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>> >>>>>>> From: "Andrew Martin" <amar...@xes-inc.com> >>>>>>> To: disc...@corosync.org >>>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM >>>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I recently configured a 3-node fileserver cluster by building Corosync >>>>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running >>>>>>> Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" >>>>>>> nodes where the resources run (a DRBD disk, filesystem mount, and >>>>>>> samba/nfs daemons), while the third node (storagequorum) is in standby >>>>>>> mode and acts as a quorum node for the cluster. Today I discovered that >>>>>>> corosync died on both storage0 and storage1 at the same time. Since >>>>>>> corosync died, pacemaker shut down as well on both nodes. Because the >>>>>>> cluster no longer had quorum (and the no-quorum-policy="freeze"), >>>>>>> storagequorum was unable to STONITH either node and just left the >>>>>>> resources frozen where they were running, on storage0. I cannot find >>>>>>> any log information to determine why corosync crashed, and this is a >>>>>>> disturbing problem as the cluster and its messaging layer must be >>>>>>> stable. Below is my corosync configuration file as well as the corosync >>>>>>> log file from e! ac! >> h! >>> ! >>>> n! >>>>> o! >>>>>> de during >>>>>> this period. >>>>>>> >>>>>>> corosync.conf: >>>>>>> http://pastebin.com/vWQDVmg8 >>>>>>> Note that I have two redundant rings. On one of them, I specify the IP >>>>>>> address (in this example 10.10.10.7) so that it binds to the correct >>>>>>> interface (since potentially in the future those machines may have two >>>>>>> interfaces on the same subnet). >>>>>>> >>>>>>> corosync.log from storage0: >>>>>>> http://pastebin.com/HK8KYDDQ >>>>>>> >>>>>>> corosync.log from storage1: >>>>>>> http://pastebin.com/sDWkcPUz >>>>>>> >>>>>>> corosync.log from storagequorum (the DC during this period): >>>>>>> http://pastebin.com/uENQ5fnf >>>>>>> >>>>>>> Issuing service corosync start && service pacemaker start on storage0 >>>>>>> and storage1 resolved the problem and allowed the nodes to successfully >>>>>>> reconnect to the cluster. What other information can I provide to help >>>>>>> diagnose this problem and prevent it from recurring? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Andrew Martin >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list >>>>>>> disc...@corosync.org >>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list >>>>>>> disc...@corosync.org >>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> disc...@corosync.org >>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> disc...@corosync.org >>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> >> >> >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org