On Thu, Oct 21, 2010 at 6:46 PM, Stephan-Frank Henry <frank.he...@gmx.net> wrote: >> Andrew Beekhof >> Mon, 13 Sep 2010 06:25:48 -0700 >> >> Looks like corosync can't talk to itself - ie. it never sees the >> multicast messages it sends out. >> This would result in the pacemaker errors you're seeing. >> >> Almost always this is a firewall issue :-) >> Perhaps try disabling it completely? > > Sorry for the late reply. I have been seeing a lot of strange stuff and I > wanted them confirmed before wasting more of your time. I also tried some > fixes I had found in the net and naturally had other stuff to do. > > Even though I do not always reply immediately or at all, I am always very > grateful for everyone's help. > > I also upgraded all our libs to the latest version and including those from > the Lenny backports. > I had initially thought it was a problem with our kernel but that did not pan > out. > But I'll just stick with the newer versions for now. > > The strangest thing is, the behavior is completely random. > > Sometimes it just works. > Sometimes it just dies after the start (could be the race condition mentioned > below) > Sometimes I get these: > crmd: [3086]: WARN: lrm_signon: can not initiate connection > crmd: [3086]: WARN: do_lrm_control: Failed to sign on to the LRM 2 (30 max) > times > crmd: [3086]: info: ais_dispatch: Membership 92: quorum still lost > Sometimes I get these: > crmd: [3067]: info: crm_timer_popped: Wait Timer (I_NULL) just popped! > crmd: [3067]: info: do_cib_control: Could not connect to the CIB service: > connection failed > crmd: [3067]: WARN: do_cib_control: Couldn't complete CIB registration 2 > times... pause and retry
warnings are just that, warnings. in this case a couple of processes are taking a little while to start. > Or these: > crmd: [2649]: notice: Not currently connected. > crmd: [2649]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry > crmd: [2649]: info: te_connect_stonith: Attempting connection to fencing > daemon... > crmd: [2649]: ERROR: stonithd_signon: Can't initiate connection to stonithd What does "ps axf" show? I'm guessing you'll see one or more zombie processes where stonithd is supposed to be. Which would mean you're hitting this: http://theclusterguy.clusterlabs.org/post/907043024/introducing-the-pacemaker-master-control-process-for Grab 1.1.4 and use option 2. > And I get these all the time. > (all above were actually taken from one machine and from subsequent reboots, > but applies to all) > > Strangest of all, if I stop (and kill) corosync and restart it via init.d > manually, it works fine. > Even without any of the changes mentioned below. > > I have tried a lot of (crazy) stuff: > * different network setups > * different hardware > * created resolv.conf > * inet config > * ntp corrections > * disabled firehol (chmod -x on the script) > * disabled bind9 (same) > * disabled drbd from the runlevels (not the script as those above) > * change runlevels as per this post (moved mine to rcS.d/S98corosync): > http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/005010.html > * upgrading to corosync to 1.2.1.2 due to this race condition bug and > subsequent fix: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596694 > * removed stuff from the root tag in the cib.xml I thought might be a problem > I have a template cib.xml that I use and I removed everything that > crm_verify did not complain about. > * sacrificed coffee to the IT gods. > > All on different machines but all with nearly the same outcome. Changes were > not all done on the same machines to confirm a fix had I ever found it. > I also have a proof-of-concept machine with just the net-install+latest > kernel from backports and the HA setup and it is showing similar behavior. > > Summary: > Debian Lenny 64bit > linux-image-2.6.33.3 > > Packages: > (default) > corosync 1.2.1-1~bpo50+1 > libcorosync4 1.2.1-1~bpo50+1 > (updated) > corosync 1.2.1-2 > libcorosync4 1.2.1-2 > > cluster-glue 1.0.6-1~bpo50+1 > libcluster-glue 1.0.6-1~bpo50+1 > pacemaker 1.0.9.1+hg15626-1~bpo50+1 > libheartbeat2 1:3.0.3-2~bpo50+1 > drbd8-utils 2:8.3.7-1~bpo50+1 > > Any thing else you'd need? > > thanks again. > > Frank > -- > GRATIS! Movie-FLAT mit über 300 Videos. > Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker