Alexander Lind wrote:
>> As for RAID on a firewall, uh...no, all things considered, I'd rather
>> AVOID that, actually.  Between added complexity,
> what complexity?

RAID, kiddo.
It's more complex.  It is something else that can go wrong.
And...it DOES go wrong.  Either believe me now, or wish you believed me
later.  Your call.  I spent a lot of time profiting from people who
ignored my advice. :)

>>  added boot time, and
>> disks that can't be used without the RAID controller,
> why would you want to use your disk WITHOUT the raid controller?

Oh, say, maybe your RAID controller failed?
Or the spare machine you had didn't happen to have the same brand and
model RAID card?
Or the replacement RAID card happened to have a different firmware on
it, and the newer firmware wouldn't read your old disk pack?  (yes,
that's a real issue).

>>  it is a major
>> loser when it comes to total up-time if you do things right.  Put a
>> second disk in the machine, and regularly dump the primary to the
>> secondary.  Blow the primary drive, you simply remove it, and boot off
>> the secondary (and yes, you test test test this to make sure you did it
>> right!). 
> Now you're talking crazy. Lets consider the two setups:
> No-raid setup:
>   - two separately controlled disks, you are in charge of syncing
> between them

yep.  you better test your work from time to time.
(wow...come to think of it, you better test your RAID assumptions, too.
 Few people do that, they just assume "it works".  This leads to people
proving me right about simplicity vs. complexity)

>   - if one dies, the machine goes down, and you go to the machine, and
> manually boot from the backup disk

yep.  Meanwhile, the system has been running just fine on the SECONDARY
SYSTEM.

>   - IF you had important data on the dead disk not yet backed up, you
> are screwed.

Ah, so you are in the habit of keeping important, non-backed up data on
your firewall?  wow.

> you could almost look at this as poor mans manual pretend raid.

Or as part of RAIC: Redundant Array of Inexpensive Computers.

> Raid setup:
>   - two disks, constantly synced, if one dies, the machine does NOT go down

you are funny.  Or inexperienced.

>   - if a disk fails, just go and plug a new one in _at your
> convenience*_ and it will autmatically rebuild, a task any person could
> perform with proper direction. Not a seconds downtime.

That's the way it is SUPPOSED to work.
Reality is very, very different some times.

Simple systems have simple problems.
Complex systems have complex problems.

Worst down-time events I've ever seen always seem to involve a RAID
system, usually managed by someone who said, "does NOT go down!", who
believed that complexity was the solution to a problem

A RAID controller never causes downtime in a system its not installed
in.  Power distribution boards don't fail on machines that don't have
them.  Hotplug backplanes don't fail on machines that don't have them.
(seen 'em all happen).

> * this is _very_ important if your machine is hosted where you don't
> have easy physical access to it. Machines at a colo center would be a
> very common scenario.

That is correct... IF that was what we were talking about.  It isn't.
You keep trying to use the wrong special case for the topic at hand.

Design your solutions to meet the problem in front of you, not a totally
unrelated problem.

>>  RAID is great when you have constantly changing data and you
>> don't want to lose ANYTHING EVER (i.e., mail server).  When you have a
>> mostly-static system like a firewall, there are simpler and better ways.
>>   
> RAID is great for any server.

WRONG.
It is good for the right systems in the right places.  There are a lot
of those places.
It is great when administered by someone who understands the limitations
of it.  That, sadly, is uncommon.

> So are scsi drives. 

I've been hearing that "SCSI is better!" stuff for 20 years, most of
that while working in service and support of LOTS of companys' computers.

It *may* be true that SCSI drives are more reliable than IDE drives,
though I really suspect if it is really true on average, the variation
between models is probably greater than the difference between
interfaces.  But that's just the drive, and I'm giving you that.

HOWEVER, by the time you add the SCSI controller, the software and the
other stuff in a SCSI solution, you have a much more cranky beast than
your IDE disk systems usually are.  No, it isn't supposed to be that
way, but experience has shown me that SCSI cards suck, SCSI drivers
suck, you rarely have the right cables and terminators on hand, and
people rarely screw up IDE drivers or chips as badly as they do the SCSI
chips and drivers (and I am most certainly not talking just OpenBSD
here).  No question in my mind on this.  I've seen too many bad things
happen with SCSI...none of which that should have...but they did, anyway.

> If you are a company
> that loses more money on a few hours (or even minutes) downtime than it
> costs to invest in proper servers with proper hw raid + scsi disks, then
> you are ill-advised _not_ to raid all your missioncritical servers. And
> have backup machines, too!  Preferably loadbalanced.

No, if controlling downtime is important to you, you have to look at the
ENTIRE solution, not chant mantras that you don't fully understand about
tiny little bits of individual computers that make up whole systems
(note: "system" here being used to indicate much more than one computer).

>> A couple months ago, our Celeron 600 firewall seemed to be having
>> "problems", which we thought may have been due to processor load.  We
>> were able to pull the disk out of it, put it in a much faster machine,
>> adjust a few files, and we were back up and running quickly...and found
>> that the problem was actually due to a router misconfig and a run-away
>> nmap session.  Would not have been able to do that with a RAID card.
>>   
> Next time, you may want to check what the machine is actually doing
> before you start blaming your hardware.
> I personally would not trust the OS setup on one machine to run smoothly
> in any machine not more or less identical to itself as far as the hw
> goes. Especially not for a production unit.

Ah, a windows user, I see. ;)
Understand how OpenBSD works, you will understand that this is not a
problem.  It is the same kernel, same supporting files installed to the
same places in the same way and doing the same thing, whether it be on a
486 or a P4.  It is just a (very) few config files that are different.
It's truly wonderful.  It's how things should be.

> But if you really wanted too, you could move the entire raid array over
> to a different machine, if that makes you happy.

Assuming you have practiced and practiced and practiced this process.
Do it wrong, you can kiss all copies of your data bye-bye, too.  Some
RAID controllers make it really easy to do this.  Others make it really
easy to clear your disks of all data...And sometimes, two cards with
really similar model numbers in machines you thought were really close
to being the same have really big differences you didn't anticipate.

Don't get me wrong, RAID has its place, and it has a very good place on
a lot of systems, maybe even most systems that call themselves servers
(and if it wasn't for the cost, most systems that call themselves
workstations, too).  I have a snootload of different types of RAID
systems around here (and btw, bioctl(8) rocks!).  My firewall runs
ccd(4) mirroring, in fact (mostly because I'm curious how it fails in
real life.  All things considered, I much prefer the design I described
earlier).

But in /this/ case, we are talking about a particular application,
firewalls in an office.  It doesn't really matter one bit what would be
more appropriate at a CoLo site if that's not what we are talking about.

OpenBSD makes it almost trivial to make entire redundant pairs of
machines.  Think of it as RAID on steroids...it isn't just redundant
disks, it is redundant NICs, power supplies, disk controllers, cables,
processors, memory, cases...everything.  PLUS, it not only helps you
with your uptime in case of failure, it also makes a lot of other things
easier, too, such as software and hardware upgrades, so you are more
likely to do upgrades when needed.  At this point, RAID and redundant
power supplies and such just make life more expensive and more complex,
not better.


Last February, I had an opportunity to replace a bunch of old servers at
my company's branch offices.  About 11 branches got big, "classic"
servers, about 15 smaller branches got workstations converted into
servers by adding an Accusys box for simple SATA mirroring.  The big
branches needed the faster performance of the big server, the small
branches just needed to get rid of the old servers that were getting
very unreliable (guess what?  It was the SCSI back planes and the RAID
controllers that were causing us no end of trouble).  It has been an
interesting test of "server vs. workstation" and "SATA vs. SCSI".

It is actually hard to tell who is ahead...
    Disk failures have been a close call: about the same number of SCSI
disks have failed as SATA disks, but the SCSI systems have ten disks vs.
three for the SATA machines, so there are more SCSI disks, but fewer
SCSI systems.  You will look at the disk count, the users look at "is my
system working?".
    Three SCSI disks have off-lined themselves, but simply unplugging
and plugging them back in has resulted in them coming back up, and
staying up.  Scary.
    Most of the SATA failures have been "clean" failures, though one
drive was doing massive retries, and eventually "succeeded", so the
system performance was horrible until we manually off-lined the
offending drive (which was easy to spot by looking at the disk activity
lights).
    One system actually lost data: one of the SCSI systems had a disk
off-line itself that was not noted by on-site staff, and a week later,
that drive's mirror failed, too (note: the first drive just off-lined
itself..no apparent reason, and it happily went back on-line).
Unfortunately, the second-rate OS on these things lacked something like
bioctl(4) to easily monitor the machines...  Complexity doesn't save you
from user error...though it might add to it.
    The drive failures on the SATA systems immediately results in a
phone call, "there's this loud beeping coming from my server room!".
    We went through a month or two where drives seemed to be popping all
over the place...and since then, things have been very reliable...
    Working with the RAID system on the SCSI machines is something that
needs to be practiced...working with the Accusys boxes is simplicity in
the extreme.
    The little machines are cheap enough we have three spares in boxes
waiting to be next-day shipped to anywhere we MIGHT have a problem.
    The big machines cost a lot of money to next-day ship anywhere, so
we don't even think of it unless we are sure we got a big problem.
    Only one machine has been next-day shipped: one of the big ones, at
a price of about 1/4th the cost of an entire little machine (after the
dual-disk failures, I figured let's get a new machine on site, get them
back up, and we'll worry about the old machine later).

About eight months into the project, I can say, the performance of the
big RAID 1+0 systems rock, but I love the simplicity of the little
machines...Ask me again in three years. :)

Nick.

Reply via email to