On 02/09/2011 05:47 PM, Pentarh Udi wrote: > I noticed that pacemaker does not correctly failover nodes under heavy > load when they go into deep swap or heavy IO. > > I configuring >1 nodes running apache with MaxClients big enough to > swap out the node, putting there some heavy php scripts (Wordpress ^_^) > and then run heavy webserver benchmarks. > > When the node comes into deep swap, load averages goes to thousands and > its stun (but pings are okay), pacemaker in some reason do not mark the > node as failed and do not migrate resources away. > > Even more. In certain conditions pacemaker starts to migrate resources > away, but they are failed to start on other nodes (while in normal > condition it starts them okay): > > httpd_start_0 (node=node1, call=32, rc=1, status=complete): unknown error > httpd_start_0 (node=node2, call=43, rc=1, status=complete): unknown error > > Sometimes there is a timeout error, sometimes there are no errors ever, > but result is the resources are down. > > In this case ocf::heartbeat:apache running in a group > with ocf::heartbeat:IPaddr2, so maybe pacemaker failed to stop IPaddr2 > so it can't move ocf::heartbeat:apache because they are in a same group. > > Is it a corosync "normal" behavior or I do something wrong? 90% of my > "down conditions" are heavy load, and corosync does not handle this in > my case.
I get this question a lot in classes and workshops. My usual response is this: you have a highly available application running on a particular node. That node now freaks out in terms of load average, swap or whatever. What's the misbehaving application? That's right, 99% of the time it's actually your cluster managed HA application. What's causing this load? Your clients are. Now when you fail over, you get near-certainty of that load spike hitting you right back on the node that took iver the service. Worse, in active/active clusters you'll actually have higher load because there are now fewer nodes to handle the cluster's workload. You can cause failover here by a combination of node fencing and watchdog devices, but if you do set this up it'll likely make your problem worse. You need to fix your scalability issue. There's little that high availability clustering can do for you here. Hope this helps. Florian
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker