On Tue, Sep 20, 2011 at 10:34 PM, Brad Johnson <bjohn...@ecessa.com> wrote: > It is not necessarily the case that the outside world can't reach the > cluster. Ours is a multi-homed device connecting to multiple WANs and LANs. > We want the device with the best connectivity to be the active device. To > get around the problem of failovers occurring when a ping node reboots for > example, I have written an fping OCF RA that uses different dampening delays > based on if it is running on the active or idle device. I have also patched > pacemaker attrd.c to fix it so it doesn't send an immediate update when it > receives a flush message from the other node. This was causing it to ignore > any running delay timer.
Thats the point of the flush message though. So that all nodes write their current value at the same time. > Here is that patch: > > --- tools/attrd.orig.c 2011-09-13 08:29:46.946820348 -0500 > +++ tools/attrd.c 2011-09-14 13:33:59.606894754 -0500 > @@ -348,10 +348,14 @@ > attrd_local_callback(xml); > > } else if(ignore == NULL || safe_str_neq(from, attrd_uname)) { > + const char *attr = crm_element_value(xml, F_ATTRD_ATTRIBUTE); > + /* Don't send update for score if msg is from other node */ > + if(safe_str_eq(from, attrd_uname) || safe_str_neq(attr, "pingd")) { > crm_info("%s message from %s", op, from); > hash_entry = find_hash_entry(xml); > stop_attrd_timer(hash_entry); > attrd_perform_update(hash_entry); > + } > } > free_xml(xml); > } > > > On 09/19/2011 10:51 PM, Andrew Beekhof wrote: >> >> On Sun, Sep 11, 2011 at 2:30 AM, Vadym Chepkov<vchep...@gmail.com> wrote: >>> >>> On Sep 8, 2011, at 3:40 PM, Florian Haas wrote: >>> >>>>>> On 09/08/11 20:59, Brad Johnson wrote: >>>>>>> >>>>>>> We have a 2 node cluster with a single resource. The resource must >>>>>>> run >>>>>>> on only a single node at one time. Using the pacemaker:ocf:ping RA we >>>>>>> are pinging a WAN gateway and a LAN host on each node so the resource >>>>>>> runs on the node with the greatest connectivity. The problem is when >>>>>>> a >>>>>>> ping host goes down (so both nodes lose connectivity to it), the >>>>>>> resource moves to the other node due to timing differences in how >>>>>>> fast >>>>>>> they update the score attribute. The dampening value has no effect, >>>>>>> since it delays both nodes by the same amount. These unnecessary >>>>>>> fail-overs aren't acceptable since they are disruptive to the network >>>>>>> for no reason. >>>>>>> Is there a way to dampen the ping update by different amounts on the >>>>>>> active and passive nodes? Or some other way to configure the cluster >>>>>>> to >>>>>>> try to keep the resource where it is during these tie score >>>>>>> scenarios? >>>> >>>> location pingd-constraint group_1 \ >>>> rule $id="pingd-constraint-rule" pingd: defined pingd >>>> >>>> May I suggest that you simply change this constraint to >>>> >>>> location pingd-constraint group_1 \ >>>> rule $id="pingd-constraint-rule" \ >>>> -inf: not_defined pingd or pingd lte 0 >>>> >>>> That way, only a host that definitely has _no_ connectivity carries a >>>> -INF score for that resource group. And I believe that is what you >>>> really want, rather than take the actual ping score as a placement >>>> weight (your "best connectivity" approach). >>>> >>>> Just my 2 cents, though. >>>> >>> Even though this approach was recommended many times, there is a problem >>> with it. >>> What if all nodes for some reason are not able to ping ? >>> This rule would cause a resource to be brought down completely, whereas >>> if you use "best connectivity" approach it will stay up where it was before >>> network failed. >> >> If the outside[1] world can't reach the cluster, is there much benefit >> in having it running? >> >> [1] Substitute "outside" for wherever your users are, hopefully you >> picked a ping node from the same area. >> >>> Vadym >>> >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker