On Mon, May 25, 2009 at 5:08 PM, Florian Haas <flor...@linbit.com> wrote: > Hello everyone, > > I realize this is primarily an OpenAIS issue, but let's discuss it here > anyway to share some thoughts. > > In Heartbeat-based clusters, we've always advised customers to use > redundant network communication links. Given the fact that most of the > clusters we build are DRBD based, we practically always have a second > network link (the dedicated DRBD replication link) available for this > purpose. In Heartbeat, when links get interrupted it's actually somewhat > nontrivial to notice (which sucks), but links recover automatically when > they are re-established (which is good). > > Now in OpenAIS, when we configure RRP and a link breaks, OpenAIS > complains very loudly (which is good), but eventually the link settles > in a faulty state from which it can only be re-enabled using > "openais-cfgtool -r". Clearly this breaks the concept of a self-healing > system. > > This discussion has been had before over on the openais list > (http://www.mail-archive.com/open...@lists.linux-foundation.org/msg01205.html), > but AFAICS it hasn't come to any reasonable conclusion. So my question > is, what is the best practice for redundant network setups that should > be included in the Pacemaker docs?
SUSE is currently recommending NIC bonding. We've not been able to get satisfactory behavior from clusters using RRP. > 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold > ridiculously high so the ring status never goes to faulty. (It seems > that RRP "problem counting" can't be disabled altogether). > > 2. Have package maintainers include some magic that does > "openais-cfgtool -r" every time a network link changes its status to UP > (where the network management subsystem permits this). > > 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in > specified intervals, causing OpenAIS to re-check the link status > periodically. You could add it to the drbd monitor action I guess. But it does seem sub-optimal. I think the best solution is to work with upstream to get the feature working properly. > > 4. Something else I haven't thought about. > > Thoughts? Comments? > > Cheers, > Florian > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker