On Tue, Aug 28, 2012 at 6:44 AM, Andrew Martin <amar...@xes-inc.com> wrote: > Hi Jake, > > Thank you for the detailed analysis of this problem. The original reason I > was utilizing ocf:pacemaker:ping was to ensure that the node with the best > network connectivity (network connectivity being judged by the ability to > communicate with 192.168.0.128 and 192.168.0.129) would be the one running > the resources. However, it is possible that that either of these IPs could > be down for maintenance or a hardware failure, and the cluster should not be > affected by this. It seems that a synchronous ping check from all of the > nodes would ensure this behavior without this unfortunate side-effect. > > Is there another way to achieve the same network connectivity check instead > of using ocf:pacemaker:ping? I know the other *ping* resource agents are > deprecated.
With the correct value of dampen, things should behave as expected regardless of which ping variant is used. > > Thanks, > > Andrew > > ________________________________ > From: "Jake Smith" <jsm...@argotec.com> > To: "Andrew Martin" <amar...@xes-inc.com> > Cc: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Monday, August 27, 2012 1:47:25 PM > > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces > resources to restart? > > > ----- Original Message ----- >> From: "Andrew Martin" <amar...@xes-inc.com> >> To: "Jake Smith" <jsm...@argotec.com>, "The Pacemaker cluster resource >> manager" <pacemaker@oss.clusterlabs.org> >> Sent: Monday, August 27, 2012 1:01:54 PM >> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces >> resources to restart? >> >> Jake, >> >> >> Attached is the log from the same period for node2. If I am reading >> this correctly, it looks like there was a 7 second difference >> between when node1 set its score to 1000 and when node2 set its >> score to 1000? > > I agree and (I think) more importantly this is what caused the issue to the > best of my knowledge - not necessarily fact ;-) > > At 10:40:43 node1 updates it's pingd to 1000 causing the policy engine to > recalculate node preference > At 10:40:44 transition 760 is initiated to move everything to the more > preferred node2 because it's pingd value is 2000 > At 10:40:50 node2's pingd value drops to 1000. Policy engine doesn't > stop/change the in-process transition - node1 and 2 are equal now but the > transition is in process and node1 isn't more preferred so it continues. > At 10:41:02 ping is back on node1 and ready to update pingd to 2000 > At 10:41:07 after dampen node1 updates pingd to 2000 which is greater than > node2's value > At 10:41:08 cluster recognizes change in pingd value that requires a > recalculation of node preference and aborts the in-process transition (760). > I believe the cluster then waits for all in-process actions to complete so > the cluster is in a known state to recalculate > At 10:42:10 I'm guessing the shutdown timeout is reached without completing > so then VirtualDomain is forcibly shutdown > Once all of that is done the transition 760 is done stopping/aborting with > some transactions completed and some not: > > Aug 22 10:42:13 node1 crmd: [4403]: notice: run_graph: Transition 760 > (Complete=20, Pending=0, Fired=0, Skipped=39, Incomplete=30, > Source=/var/lib/pengine/pe-input-2952.bz2): Stopped > Then the cluster recalculates the node preference and restarts those > services that are stopped on node1 because pingd scores between node1 and > node2 are equal so there is preference to stay on node1 where some services > are still active (drbd or such I'm guessing are still running on node1) > > >> Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked: >> attrd_updater -n p_ping -v 1000 -d 5s > > Before this is the ping fail: > > Aug 22 10:40:31 node1 ping[1668]: [1823]: WARNING: 192.168.0.128 is > inactive: PING 192.168.0.128 (192.168.0.128) 56(84) bytes of > data.#012#012--- 192.168.0.128 ping statistics ---#0128 packets transmitted, > 0 received, 100% packet loss, time 7055ms > > Then you get the 7 second delay to do the 8 attempts I believe and then the > 5 second dampen (-d 5s) brings us to: > >> Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: p_ping (1000) >> Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update: >> Sent update 265: p_ping=1000 >> > > Same thing on node2 - fails at 10:40:38 and then 7 seconds later: > >> Aug 22 10:40:45 node2 attrd_updater: [27245]: info: Invoked: >> attrd_updater -n p_ping -v 1000 -d 5s > > 5s Dampen > >> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: p_ping (1000) >> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_perform_update: >> Sent update 122: p_ping=1000 >> >> I had changed the attempts value to 8 (from the default 2) to address >> this same issue - to avoid resource migration based on brief >> connectivity problems with these IPs - however if we can get dampen >> configured correctly I'll set it back to the default. >> > > Well after looking through both more closely I'm not sure dampen is what > you'll need to fix the deeper problem. The time between fail and return was > 10:40:31 to 10:41:02 or 32 seconds (31 on node2). I believe if you had a > dampen value that was greater than monitor value plus time failed then > nothing would have happened (dampen > 10 + 32). However I'm not sure I > would call 32 seconds a blip in connection - that's up to you. And since > the dampen applies to all of the ping clones equally assuming a ping failure > longer than your dampen value you would still have the same problem. For > example assuming a dampen of 45 seconds: > Node1 fails at 1:01, node2 fails at 1:08 > Node1 will still update its pingd value at 1:52 - 7 seconds before node2 > will and the transition will still happen even though both nodes have the > same connectivity in reality. > > I guess what I'm saying in the end is that dampen is there to prevent > movement for a momentary outage/blip in the pings with the idea being that > the pings will return before the dampen expires. It isn't going to wait out > the dampen on the other node(s) before making a decision. You would need to > be able to add something like a sleep 10s in there AFTER the pingd value is > updated BEFORE evaluating the node preference scoring! > > So in the end I don't have a fix for you except maybe to set dampen in the > 45-60 second range if you expect around 30 second outages that you want to > ride out without moving to be common place in your setup. However that > would extend the time to wait till failover in case of a complete failure of > pings on one node only. > > :-( > > Jake > >> >> Thanks, >> >> >> Andrew >> >> ----- Original Message ----- >> >> From: "Jake Smith" <jsm...@argotec.com> >> To: "The Pacemaker cluster resource manager" >> <pacemaker@oss.clusterlabs.org> >> Sent: Monday, August 27, 2012 9:39:30 AM >> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces >> resources to restart? >> >> >> ----- Original Message ----- >> > From: "Andrew Martin" <amar...@xes-inc.com> >> > To: "The Pacemaker cluster resource manager" >> > <pacemaker@oss.clusterlabs.org> >> > Sent: Thursday, August 23, 2012 7:36:26 PM >> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces >> > resources to restart? >> > >> > Hi Florian, >> > >> > >> > Thanks for the suggestion. I gave it a try, but even with a dampen >> > value greater than 2* the monitoring interval the same behavior >> > occurred (pacemaker restarted the resources on the same node). Here >> > are my current ocf:pacemaker:ping settings: >> > >> > primitive p_ping ocf:pacemaker:ping \ >> > params name="p_ping" host_list="192.168.0.128 192.168.0.129" >> > dampen="25s" multiplier="1000" attempts="8" debug="true" \ >> > op start interval="0" timeout="60" \ >> > op monitor interval="10s" timeout="60" >> > >> > >> > Any other ideas on what is causing this behavior? My understanding >> > is >> > the above config tells the cluster to attempt 8 pings to each of >> > the >> > IPs, and will assume that an IP is down if none of the 8 come back. >> > Thus, an IP would have to be down for more than 8 seconds to be >> > considered down. The dampen parameter tells the cluster to wait >> > before making any decision, so that if the IP comes back online >> > within the dampen period then no action is taken. Is this correct? >> > >> > >> >> I'm no expert on this either but I believe the dampen isn't long >> enough - I think what you say above is correct but not only does the >> IP need to come back online but the cluster must attempt to ping it >> successfully also. I would suggest trying dampen with greater than >> 3*monitor value. >> >> I don't think it's a problem but why change the attempts from the >> default 2 to 8? >> >> > Thanks, >> > >> > >> > Andrew >> > >> > >> > ----- Original Message ----- >> > >> > From: "Florian Crouzat" <gen...@floriancrouzat.net> >> > To: pacemaker@oss.clusterlabs.org >> > Sent: Thursday, August 23, 2012 3:57:02 AM >> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces >> > resources to restart? >> > >> > Le 22/08/2012 18:23, Andrew Martin a écrit : >> > > Hello, >> > > >> > > >> > > I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and >> > > 1 >> > > quorum node that cannot run resources) running on Ubuntu 12.04 >> > > Server amd64. This cluster has a DRBD resource that it mounts and >> > > then runs a KVM virtual machine from. I have configured the >> > > cluster to use ocf:pacemaker:ping with two other devices on the >> > > network (192.168.0.128, 192.168.0.129), and set constraints to >> > > move the resources to the most well-connected node (whichever >> > > node >> > > can see more of these two devices): >> > > >> > > primitive p_ping ocf:pacemaker:ping \ >> > > params name="p_ping" host_list="192.168.0.128 192.168.0.129" >> > > multiplier="1000" attempts="8" debug="true" \ >> > > op start interval="0" timeout="60" \ >> > > op monitor interval="10s" timeout="60" >> > > ... >> > > >> > > clone cl_ping p_ping \ >> > > meta interleave="true" >> > > >> > > ... >> > > location loc_run_on_most_connected g_vm \ >> > > rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping >> > > >> > > >> > > Today, 192.168.0.128's network cable was unplugged for a few >> > > seconds and then plugged back in. During this time, pacemaker >> > > recognized that it could not ping 192.168.0.128 and restarted all >> > > of the resources, but left them on the same node. My >> > > understanding >> > > was that since neither node could ping 192.168.0.128 during this >> > > period, pacemaker would do nothing with the resources (leave them >> > > running). It would only migrate or restart the resources if for >> > > example node2 could ping 192.168.0.128 but node1 could not (move >> > > the resources to where things are better-connected). Is this >> > > understanding incorrect? If so, is there a way I can change my >> > > configuration so that it will only restart/migrate resources if >> > > one node is found to be better connected? >> > > >> > > Can you tell me why these resources were restarted? I have >> > > attached >> > > the syslog as well as my full CIB configuration. >> > > >> >> As was said already the log shows node1 changed it's value for pingd >> to 1000, waited the 5 seconds of dampening and then started actions >> to move the resources. In the midst of stopping everything ping ran >> again successfully and the value increase back to 2000. This caused >> the policy engine to recalculate scores for all resources (before >> they had the chance to start on node2). I'm no scoring expert but I >> know there is additional value given to keep resources that are >> collocated together with their partners that are already running and >> resource stickiness to not move. So in this situation the score to >> stay/run on node1 once pingd was back at 2000 was greater that the >> score to move so things that were stopped or stopping restarted on >> node1. So increasing the dampen value should help/fix. >> >> Unfortunately you didn't include the log from node2 so we can't >> correlate what node2's pingd values are at the same times as node1. >> I believe if you look at the pingd values and times that movement is >> started between the nodes you will be able to make a better guess at >> how high a dampen value would make sure the nodes had the same pingd >> value *before* the dampen time ran out and that should prevent >> movement. >> >> HTH >> >> Jake >> >> > > Thanks, >> > > >> > > Andrew Martin >> > > >> > >> > This is an interesting question and I'm also interested in answers. >> > >> > I had the same observations, and there is also the case where the >> > monitor() aren't synced across all nodes so, "Node 1 issue a >> > monitor() >> > on the ping resource and finds ping-node dead, node2 hasn't pinged >> > yet, >> > so node1 moves things to node2 but node2 now issue a monitor() and >> > also >> > finds ping-node dead." >> > >> > The only solution I found was to adjust the dampen parameter to at >> > least >> > 2*monitor().interval so that I can be *sure* that all nodes have >> > issued >> > a monitor() and they all decreased they scores so that when a >> > decision >> > occurs, nothings move. >> > >> > It's been a long time I haven't tested, my cluster is very very >> > stable, >> > I guess I should retry to validate it's still a working trick. >> > >> > ==== >> > >> > dampen (integer, [5s]): Dampening interval >> > The time to wait (dampening) further changes occur >> > >> > Eg: >> > >> > primitive ping-nq-sw-swsec ocf:pacemaker:ping \ >> > params host_list="192.168.10.1 192.168.2.11 192.168.2.12" >> > dampen="35s" attempts="2" timeout="2" multiplier="100" \ >> > op monitor interval="15s" >> > >> > >> > >> > >> > -- >> > Cheers, >> > Florian Crouzat >> > >> > _______________________________________________ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: >> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: http://bugs.clusterlabs.org >> > >> > _______________________________________________ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: >> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: http://bugs.clusterlabs.org >> > >> > >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org