Marc Volovic wrote:
A number of issues:
First - what. You need to replicate (a) links, (b) storage, (c) service
machines.
Links are internal and external. Multipath internet connexions.
Multipath LAN connexions. Multipath storage links. Redund network
infrastructure (switches, routers, firewalls, IDS/IPS).
Replicate storage. If you use SAN with dedicated links, multipath links
and storage. Redund storage hardware and add storage replication. Add
auto-promotion, takeover, and (if possible) partition prevention
mechanisms. Use STONITH.
Service machines are the easiest to replicte. Simple heartbeat will
provide a significant level of failover and/or failback. Here, likewise,
use STONITH or other partition prevention mechanisms.
Under-utilize. 70% duty cycle is good.
Expect costing hikes.
and - test test test.....
many people fail to test their "highly-available" setup, and as a
result, think they are 'highly available" when they are not.
testing should include various types of scenarios that will show you
bugs in various tools as well as configuration errors.
examples: you set up multi-path to the storage, but the default I/O
timeouts are too large -> this easily causes multi-path fail over taking
several minutes in some scenarios.
you set up heartbeat and think eerything is ok - but then you find that
it doesn't really notice failure in access to the storage system, and
when there's a connectivity problem just to your SAN system from the
active node - it doesn't fail over to the passive node.
only with rigorous testing you'll find these issues - and usually not on
the first time you test (because this testing is tedious, and because
some problems are not easy to simulate - e.g. try to simulate a
hard-disk failure - plus, sometimes there are races - and a given test
type will fail only once every few attempts...)
--guy
_______________________________________________
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il