It would appear my issues are related to timekeeping on these boxes (Compaq DL360 G1).
If I bump advbase to '3' on each box everything is more stable. Given this, I now have a roughly 10 second fail-over time, but that is still acceptable. Since these are production boxes I'll probably wait until my 3.9 arrives to see if any of the kern_time/kern_clock changes help. I'll let everyone know more when I do. Thanks for all the pointers and assistance! Steve's corollary to Henning's carp theorem ("carp works."): Unless the system clock is broken:-) -Steve S.