Alexander Lind wrote: >> As for RAID on a firewall, uh...no, all things considered, I'd rather >> AVOID that, actually. Between added complexity, > what complexity?
RAID, kiddo. It's more complex. It is something else that can go wrong. And...it DOES go wrong. Either believe me now, or wish you believed me later. Your call. I spent a lot of time profiting from people who ignored my advice. :) >> added boot time, and >> disks that can't be used without the RAID controller, > why would you want to use your disk WITHOUT the raid controller? Oh, say, maybe your RAID controller failed? Or the spare machine you had didn't happen to have the same brand and model RAID card? Or the replacement RAID card happened to have a different firmware on it, and the newer firmware wouldn't read your old disk pack? (yes, that's a real issue). >> it is a major >> loser when it comes to total up-time if you do things right. Put a >> second disk in the machine, and regularly dump the primary to the >> secondary. Blow the primary drive, you simply remove it, and boot off >> the secondary (and yes, you test test test this to make sure you did it >> right!). > Now you're talking crazy. Lets consider the two setups: > No-raid setup: > - two separately controlled disks, you are in charge of syncing > between them yep. you better test your work from time to time. (wow...come to think of it, you better test your RAID assumptions, too. Few people do that, they just assume "it works". This leads to people proving me right about simplicity vs. complexity) > - if one dies, the machine goes down, and you go to the machine, and > manually boot from the backup disk yep. Meanwhile, the system has been running just fine on the SECONDARY SYSTEM. > - IF you had important data on the dead disk not yet backed up, you > are screwed. Ah, so you are in the habit of keeping important, non-backed up data on your firewall? wow. > you could almost look at this as poor mans manual pretend raid. Or as part of RAIC: Redundant Array of Inexpensive Computers. > Raid setup: > - two disks, constantly synced, if one dies, the machine does NOT go down you are funny. Or inexperienced. > - if a disk fails, just go and plug a new one in _at your > convenience*_ and it will autmatically rebuild, a task any person could > perform with proper direction. Not a seconds downtime. That's the way it is SUPPOSED to work. Reality is very, very different some times. Simple systems have simple problems. Complex systems have complex problems. Worst down-time events I've ever seen always seem to involve a RAID system, usually managed by someone who said, "does NOT go down!", who believed that complexity was the solution to a problem A RAID controller never causes downtime in a system its not installed in. Power distribution boards don't fail on machines that don't have them. Hotplug backplanes don't fail on machines that don't have them. (seen 'em all happen). > * this is _very_ important if your machine is hosted where you don't > have easy physical access to it. Machines at a colo center would be a > very common scenario. That is correct... IF that was what we were talking about. It isn't. You keep trying to use the wrong special case for the topic at hand. Design your solutions to meet the problem in front of you, not a totally unrelated problem. >> RAID is great when you have constantly changing data and you >> don't want to lose ANYTHING EVER (i.e., mail server). When you have a >> mostly-static system like a firewall, there are simpler and better ways. >> > RAID is great for any server. WRONG. It is good for the right systems in the right places. There are a lot of those places. It is great when administered by someone who understands the limitations of it. That, sadly, is uncommon. > So are scsi drives. I've been hearing that "SCSI is better!" stuff for 20 years, most of that while working in service and support of LOTS of companys' computers. It *may* be true that SCSI drives are more reliable than IDE drives, though I really suspect if it is really true on average, the variation between models is probably greater than the difference between interfaces. But that's just the drive, and I'm giving you that. HOWEVER, by the time you add the SCSI controller, the software and the other stuff in a SCSI solution, you have a much more cranky beast than your IDE disk systems usually are. No, it isn't supposed to be that way, but experience has shown me that SCSI cards suck, SCSI drivers suck, you rarely have the right cables and terminators on hand, and people rarely screw up IDE drivers or chips as badly as they do the SCSI chips and drivers (and I am most certainly not talking just OpenBSD here). No question in my mind on this. I've seen too many bad things happen with SCSI...none of which that should have...but they did, anyway. > If you are a company > that loses more money on a few hours (or even minutes) downtime than it > costs to invest in proper servers with proper hw raid + scsi disks, then > you are ill-advised _not_ to raid all your missioncritical servers. And > have backup machines, too! Preferably loadbalanced. No, if controlling downtime is important to you, you have to look at the ENTIRE solution, not chant mantras that you don't fully understand about tiny little bits of individual computers that make up whole systems (note: "system" here being used to indicate much more than one computer). >> A couple months ago, our Celeron 600 firewall seemed to be having >> "problems", which we thought may have been due to processor load. We >> were able to pull the disk out of it, put it in a much faster machine, >> adjust a few files, and we were back up and running quickly...and found >> that the problem was actually due to a router misconfig and a run-away >> nmap session. Would not have been able to do that with a RAID card. >> > Next time, you may want to check what the machine is actually doing > before you start blaming your hardware. > I personally would not trust the OS setup on one machine to run smoothly > in any machine not more or less identical to itself as far as the hw > goes. Especially not for a production unit. Ah, a windows user, I see. ;) Understand how OpenBSD works, you will understand that this is not a problem. It is the same kernel, same supporting files installed to the same places in the same way and doing the same thing, whether it be on a 486 or a P4. It is just a (very) few config files that are different. It's truly wonderful. It's how things should be. > But if you really wanted too, you could move the entire raid array over > to a different machine, if that makes you happy. Assuming you have practiced and practiced and practiced this process. Do it wrong, you can kiss all copies of your data bye-bye, too. Some RAID controllers make it really easy to do this. Others make it really easy to clear your disks of all data...And sometimes, two cards with really similar model numbers in machines you thought were really close to being the same have really big differences you didn't anticipate. Don't get me wrong, RAID has its place, and it has a very good place on a lot of systems, maybe even most systems that call themselves servers (and if it wasn't for the cost, most systems that call themselves workstations, too). I have a snootload of different types of RAID systems around here (and btw, bioctl(8) rocks!). My firewall runs ccd(4) mirroring, in fact (mostly because I'm curious how it fails in real life. All things considered, I much prefer the design I described earlier). But in /this/ case, we are talking about a particular application, firewalls in an office. It doesn't really matter one bit what would be more appropriate at a CoLo site if that's not what we are talking about. OpenBSD makes it almost trivial to make entire redundant pairs of machines. Think of it as RAID on steroids...it isn't just redundant disks, it is redundant NICs, power supplies, disk controllers, cables, processors, memory, cases...everything. PLUS, it not only helps you with your uptime in case of failure, it also makes a lot of other things easier, too, such as software and hardware upgrades, so you are more likely to do upgrades when needed. At this point, RAID and redundant power supplies and such just make life more expensive and more complex, not better. Last February, I had an opportunity to replace a bunch of old servers at my company's branch offices. About 11 branches got big, "classic" servers, about 15 smaller branches got workstations converted into servers by adding an Accusys box for simple SATA mirroring. The big branches needed the faster performance of the big server, the small branches just needed to get rid of the old servers that were getting very unreliable (guess what? It was the SCSI back planes and the RAID controllers that were causing us no end of trouble). It has been an interesting test of "server vs. workstation" and "SATA vs. SCSI". It is actually hard to tell who is ahead... Disk failures have been a close call: about the same number of SCSI disks have failed as SATA disks, but the SCSI systems have ten disks vs. three for the SATA machines, so there are more SCSI disks, but fewer SCSI systems. You will look at the disk count, the users look at "is my system working?". Three SCSI disks have off-lined themselves, but simply unplugging and plugging them back in has resulted in them coming back up, and staying up. Scary. Most of the SATA failures have been "clean" failures, though one drive was doing massive retries, and eventually "succeeded", so the system performance was horrible until we manually off-lined the offending drive (which was easy to spot by looking at the disk activity lights). One system actually lost data: one of the SCSI systems had a disk off-line itself that was not noted by on-site staff, and a week later, that drive's mirror failed, too (note: the first drive just off-lined itself..no apparent reason, and it happily went back on-line). Unfortunately, the second-rate OS on these things lacked something like bioctl(4) to easily monitor the machines... Complexity doesn't save you from user error...though it might add to it. The drive failures on the SATA systems immediately results in a phone call, "there's this loud beeping coming from my server room!". We went through a month or two where drives seemed to be popping all over the place...and since then, things have been very reliable... Working with the RAID system on the SCSI machines is something that needs to be practiced...working with the Accusys boxes is simplicity in the extreme. The little machines are cheap enough we have three spares in boxes waiting to be next-day shipped to anywhere we MIGHT have a problem. The big machines cost a lot of money to next-day ship anywhere, so we don't even think of it unless we are sure we got a big problem. Only one machine has been next-day shipped: one of the big ones, at a price of about 1/4th the cost of an entire little machine (after the dual-disk failures, I figured let's get a new machine on site, get them back up, and we'll worry about the old machine later). About eight months into the project, I can say, the performance of the big RAID 1+0 systems rock, but I love the simplicity of the little machines...Ask me again in three years. :) Nick.