Marc Volovic wrote:
A number of issues:

First - what. You need to replicate (a) links, (b) storage, (c) service machines.

Links are internal and external. Multipath internet connexions. Multipath LAN connexions. Multipath storage links. Redund network infrastructure (switches, routers, firewalls, IDS/IPS).

Replicate storage. If you use SAN with dedicated links, multipath links and storage. Redund storage hardware and add storage replication. Add auto-promotion, takeover, and (if possible) partition prevention mechanisms. Use STONITH.

Service machines are the easiest to replicte. Simple heartbeat will provide a significant level of failover and/or failback. Here, likewise, use STONITH or other partition prevention mechanisms.

Under-utilize. 70% duty cycle is good.

Expect costing hikes.

and - test test test.....

many people fail to test their "highly-available" setup, and as a result, think they are 'highly available" when they are not.

testing should include various types of scenarios that will show you bugs in various tools as well as configuration errors.

examples: you set up multi-path to the storage, but the default I/O timeouts are too large -> this easily causes multi-path fail over taking several minutes in some scenarios.

you set up heartbeat and think eerything is ok - but then you find that it doesn't really notice failure in access to the storage system, and when there's a connectivity problem just to your SAN system from the active node - it doesn't fail over to the passive node.

only with rigorous testing you'll find these issues - and usually not on the first time you test (because this testing is tedious, and because some problems are not easy to simulate - e.g. try to simulate a hard-disk failure - plus, sometimes there are races - and a given test type will fail only once every few attempts...)

--guy

_______________________________________________
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Reply via email to