On Apr 5, 2010, at 2:10 PM, Maciej Jan Broniarz wrote:
> W dniu 10-04-05 22:43, jfar...@goldsword.com pisze:
>> Quoting Maciej Jan Broniarz <gau...@gausus.net>:
>> So first you have to define your workload, then define what errors you
>> must avoid or allow, and then define how to deal with failures, errors,
>> etc.
>> Then you can start talking about High Availability vs. level of Fault
>> tolerance, vs. ....
> 
> Let's say i need to run a few php/sql based web sites and I would like to 
> maintain uptime of about 99,99% per month. No matter how good the hardware - 
> it will always fail at some time. My goal is to build a system, that can 
> maintain that uptime.

You're attempting to move from ~1 hour of downtime per month to ~1 hour of 
downtime per year, or less than 5 minutes per month.  To begin with, you must 
implement adequate monitoring to detect, notify, and track service outages (ie, 
Nagios, BigBrother, commercial test services like SiteScope, etc), and you need 
a 24/7/365 team available to immediately respond to pages/email/etc to minimize 
outage duration.

With 168 hours a week divided by nominal 40-hour workweek, that needs a team of 
4.2 people, and also implies the cost of downtime per hour for the system 
should be higher than about 35K per hour to justify keeping such a team 
available.  (Four people can do it with one working an extra 8-hour shift per 
week; however, at least local to me, California state law mandates that people 
on call for pager duty must be paid hourly overtime if they are expected to 
respond to issues.)

> From what You say I need some level of HA system, to maintain the required 
> uptime.
> 
> So, as I've said earlier (correct me, if I'm wrong) - the setup could look 
> something like that:
> 
> - 2 web servers with carp
> - 2 storage servers with on-line sync mechanism running
> - 2 mysql servers with on-line database replication
> 
> (i'm skiping power and network issues at the moment).

Do you already know what your causes of downtime have been?

To my mind, you must consider all parts of the system, gather data, and resolve 
the problems which have the greatest downtime cost in a cost-effective fashion. 
 The most common sources of machine failure are hard drives and PSUs; setting 
up RAID-1 mirrors for all machines and getting redundant power supplies on 
separate breakers should be a minimal starting point to avoid a likely single 
point of failure within a single machine.

Beyond that, the suggestion to have at least two of every component of the 
system is a right notion, but you need to include the glue which implements 
failover.  That can be RFC-2391 style NAT to round-robin requests onto multiple 
webservers, but a hardware-based load-balancer (ServerIrons, Netscalers, etc) 
with aliveness or health checks will do a better job.

The better ones also support full redundancy, so you want a pair of those, or 
perhaps a pair of router/firewall/NAT boxes using VRRP or similar for the 
networking connectivity if you want to use software-based load-balancing.

Of course, all of this is assuming that the software is more reliable than the 
hardware it runs on.  Good software can be, but it's more common for software 
failures, mistakes by admins, or the like to also contribute a lot to the 
system downtime.

Regards,
-- 
-Chuck

_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to