Hi all, Here at Bump we currently have our handset traffic routed through a single server. For obvious reasons, we want to expand this to multiple nodes for redundancy. The load balancer is doing two tasks: TLS termination and then directing traffic to one of our internal application servers.
We want to split the single load balancer into an HA cluster. Our chosen solution involves creating one public facing VIP for each machine, and then floating those VIPs between the load balancer machines. Ideally there is one public IP per machine and we use DNS round robin to send traffic to the IPs. We considered having two nodes and floating a single VIP between them, the canonical heartbeat setup, but would prefer to avoid that because we know we're going to run into the situation where our TLS termination takes more CPU than we have available on a single node. Balancing across N nodes seems the most obvious way to address that. We have allocated three (3) nodes to our cluster. I want to run our design by this group and tell you our problems and see if anybody has some advice. * no-quorum-policy set to ignore. We would, ideally, like to have our cluster continue to operate even if we lose the majority of nodes. Even if we're in a CPU limited situation, it would be better to serve slowly than to drop 33% or 66% of our traffic on the floor because we lost quorum and the floating VIPs weren't migrated to the remaining nodes. * STONITH disabled. Originally I tried to enable this, but with the no-quorum-policy set to ignore, it seems to go on killing sprees. It has fenced healthy nodes for no reason I could determine: - "node standby lb1" * resources properly migrate to lb2, lb3 * everything looks stable and correct - "node online lb1" * resources start migrating back to lb1 * lb2 gets fenced! (why? it was healthy) * resources migrating off of lb2 I have seen it double-fence, too, with lb1 being the only surviving node and lb2 and lb3 being unceremoniously rebooted. I'm not sure why. STONITH seems to be suboptimal (heh) in this particular set up. Anyway -- that means our configuration is very, very simple: node $id="65c71911-737e-4848-b7d7-897d0ede172a" patron node $id="b5f2fd18-acf1-4b25-a571-a0827e07188b" oldfashioned node $id="ef11cced-0062-411b-93dd-d03c2b8b198c" nattylight primitive cluster-monitor ocf:pacemaker:ClusterMon \ params extra_options="--mail-to blah" htmlfile="blah" \ meta target-role="Started" primitive floating_216 ocf:heartbeat:IPaddr \ params ip="173.192.13.216" cidr_netmask="255.255.255.252" nic="eth1" \ op monitor interval="60s" timeout="30s" \ meta target-role="Started" primitive floating_217 ocf:heartbeat:IPaddr \ params ip="173.192.13.217" cidr_netmask="255.255.255.252" nic="eth1" \ op monitor interval="60s" timeout="30s" \ meta target-role="Started" primitive floating_218 ocf:heartbeat:IPaddr \ params ip="173.192.13.218" cidr_netmask="255.255.255.252" nic="eth1" \ op monitor interval="60s" timeout="30s" \ meta target-role="Started" property $id="cib-bootstrap-options" \ dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ symmetric-cluster="true" \ last-lrm-refresh="1317079926" Am I on the right track with this? Am I missing something obvious? Am I misapplying this tool to our problem and should I go in a different direction? In the real world, I would use ECMP (or something like that) between the router and my load balancers. However, I'm living in the world of managed server hosting (we're not quite big enough to colo) so I don't have that option. :-) -- Mark Smith // Operations Lead m...@bumptechnologies.com _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker