On Thu, 19 Sep 2013, Florian Crouzat wrote:

Le 19/09/2013 11:43, David Lang a ?crit :

I've been running active/failover firewall clusters with heartbeat since
about 2000, and one suggestion that I would make. If you can leave all
the daemons running all the time, the failover process is far more
robust (and faster since you don't have daemons to start). If you set
net.ipv4.ip_nonlocal_bind you can even have the daemons startup binding
to the VIP addresses that don't yet exist.

If you do not have to have the daemons bound to the VIP, the fact that
they are always running on the backup box gives you a quick way to check
if a failover would solve the problem or not by having a client connect
directly to the second box. The drawback is that someone may configure
something to point directly at a box and not at a VIP and you won't
detect it (without log analysis) until the box they point at actually
goes down.

David Lang

I never thought about that, it seems it could be interesting, especially with slow (start|stop)ing daemons such as squid.

yes, if the daemons are started at boot time, you don't have to worry about some subtle config error creeping in that prevents them from running when you need them.

you can also monitor the availability of the backup firewall from your network monitoring systems. Nothing's worse than having your primary fail, only to discover that your backup wasn't working (especially over something like a bad route that's not detected by the HA software that just runs on the local subnet)


In my case, my daemons would be protected by the "passive firewall state" that my nodes have when they don't host resources.

Why? I know, the real answer is 'because it's the standby, and standby boxes aren't active'. But is there really a need to do this? or it it just because?

If your systems are hardened to be a firewall, what difference does it make if they are exposed or 'proteted by the passive firewall state'?

what do you gain by changing your firewall rules when you switch between active and passive (and are you sure there is never an instant when your defenses are down during this switch, I bring up the iptables rules before bringing up the interfaces at boot)

if having something running on the primary and backup at the same time would cause a conflict, then the HA software needs to manage it (shared disk or IP is a good example), but otherwise it should be running at all times so that you know it's healthy (you can monitor it) and to reduce the work needed at failover time.

You should have both systems sending their logs to a central server, so from the point of view of knowing what's happening, there really shouldn't be a difference between the two systems, even if someone does deliberatly hit your 'backup' box



and speaking of primary and backup, if the boxes are identical hardware, it really shouldn't matter which is active, so 'primary' and 'backup' are bad names. It's best practice to regularly excercise your backup systems, and so having your HA system treat the two as equal (except in the case of both booting at the same time or recovering from split-brain when you need to designate who wins the tie) lets you run for an extended time on either box

This also helps you avoid flapping where the primary has something wrong that slows it down so it can't handle full load, but could handle partial load. under load the primary fails, you failover to the backup, the primary recovers and looks healthy, so you failover to the primary, which goes down because of the load....

I've seen this be something as simple as blocked cooling where a box was fine when idle, but overheated (and therefor the CPU throttled down to slower speeds tutomatically) under load.

Ideally you do something like schedule a failover every month or quarter from one box to the other, and just keep running on that box until the next failover.

It does mean that you need to check which box is active when you work on them, but you should do that anyway :-)

David Lang

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to