On Mon, May 25, 2009 at 6:10 PM, Florian Haas <florian.h...@linbit.com> wrote: > On 2009-05-25 17:45, Andrew Beekhof wrote: >> SUSE is currently recommending NIC bonding. >> We've not been able to get satisfactory behavior from clusters using RRP. > > I've repeatedly told customers that NIC bonding is not a valid > substitute for redundant Heartbeat links, I will stubbornly insist it > isn't one for OpenAIS RRP links either. > > Some reasons: > - You're not protected against bugs, currently known or unknown, in the > bonding driver. If bonding itself breaks, you're screwed. > - Most people actually run bonding over interfaces over the same make, > model, and chipset. That's not necessarily optimal, but it's a reality. > Thus, if your driver breaks, you're screwed again. Granted, this is > probably to if you ran two RRP links in that same configuration too. > - Finally, you can't bond between a switched and a direct back-to-back > connection, which makes bonding entirely unsuitable for the redundant > links use case I described earlier. > >>> 1. Set rrp_problem_count_timeout and/or rrp_problem_count_threshold >>> ridiculously high so the ring status never goes to faulty. (It seems >>> that RRP "problem counting" can't be disabled altogether). >>> >>> 2. Have package maintainers include some magic that does >>> "openais-cfgtool -r" every time a network link changes its status to UP >>> (where the network management subsystem permits this). >>> >>> 3. Instruct users to install cron jobs that do "openais-cfgtool -r" in >>> specified intervals, causing OpenAIS to re-check the link status >>> periodically. >> >> You could add it to the drbd monitor action I guess. >> But it does seem sub-optimal. > > I already made my point with regard to Juha's suggestion that it seems > odd for Pacemaker to fiddle with its own communication infrastructure.
Agreed so far. > To instead defer that task to a Pacemaker resource agent seems > positively disturbing. No more disturbing than #2 and what are the recurring monitor operations if not a "cron" job? >> I think the best solution is to work with upstream to get the feature >> working properly. > > That I fully agree with. The question is what "working properly" means > in this case -- should it be capable of auto-recovery, or should it not? Absolutely. Its both pointless and useless if it doesn't. _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker