Re: CPU selection

Alexander Lind Thu, 02 Nov 2006 22:06:31 -0800

>> what complexity?
>>     
>
> RAID, kiddo.
> It's more complex.  It is something else that can go wrong.
> And...it DOES go wrong.  Either believe me now, or wish you believed me
> later.  Your call.  I spent a lot of time profiting from people who
> ignored my advice. :)
>   
Of course raid are more complex on a hardware level, but that doesn't
exactly make it more complex for _me_, the user, does it?
I have deployed lots and lots of servers, both with and without raid and
using various different OS:es, and I give you that it used to be a
little tricky to get for example slackware to boot off some
semi-supported raid devices back in the day, but nowadays its all pretty
simple imho.
And the times when disks have failed, we have plopped in new disks and
they got rebuilt and I lived happily afterwards.
So really, where is you're profit margin on someone like me? ;)
>   
>>>  added boot time, and
>>> disks that can't be used without the RAID controller,
>>>       
>> why would you want to use your disk WITHOUT the raid controller?
>>     
>
> Oh, say, maybe your RAID controller failed?
> Or the spare machine you had didn't happen to have the same brand and
> model RAID card?
> Or the replacement RAID card happened to have a different firmware on
> it, and the newer firmware wouldn't read your old disk pack?  (yes,
> that's a real issue).
>   
If indeed the raid card failed, unlikely as it would be, then that could
be a little messy. Not that I ever had this problem, but you ought to be
able to downgrade raid cards if you run into the firmware problem?
>   
>>>  it is a major
>>> loser when it comes to total up-time if you do things right.  Put a
>>> second disk in the machine, and regularly dump the primary to the
>>> secondary.  Blow the primary drive, you simply remove it, and boot off
>>> the secondary (and yes, you test test test this to make sure you did it
>>> right!). 
>>>       
>> Now you're talking crazy. Lets consider the two setups:
>> No-raid setup:
>>   - two separately controlled disks, you are in charge of syncing
>> between them
>>     
>
> yep.  you better test your work from time to time.
> (wow...come to think of it, you better test your RAID assumptions, too.
>  Few people do that, they just assume "it works".  This leads to people
> proving me right about simplicity vs. complexity)
>   
If you configure it right it tends to work right. At least it does for me.
>   
>>   - if one dies, the machine goes down, and you go to the machine, and
>> manually boot from the backup disk
>>     
>
> yep.  Meanwhile, the system has been running just fine on the SECONDARY
> SYSTEM.
>   
>   
>>   - IF you had important data on the dead disk not yet backed up, you
>> are screwed.
>>     
>
> Ah, so you are in the habit of keeping important, non-backed up data on
> your firewall?  wow.
>   
of course, thats where i store my porn.
>   
>> you could almost look at this as poor mans manual pretend raid.
>>     
>
> Or as part of RAIC: Redundant Array of Inexpensive Computers.
>   
which may not always be feasible in an already densely packed rack where
every U is expensive.
>   
>> Raid setup:
>>   - two disks, constantly synced, if one dies, the machine does NOT go down
>>     
>
> you are funny.  Or inexperienced.
>   
master, you flatter me!
maybe i'm a lucky bastard, but every single disk failure i have seen in
a raided machine has been solved by pulling the disk out, and putting a
new back in.
rebuild for some time, and then the machine is happy again.
i think this has happened to servers i maintain or help maintain 5 or so
times now.
>   
>>   - if a disk fails, just go and plug a new one in _at your
>> convenience*_ and it will autmatically rebuild, a task any person could
>> perform with proper direction. Not a seconds downtime.
>>     
>
> That's the way it is SUPPOSED to work.
> Reality is very, very different some times.
>   
my servers must be living in fantasyland or something.
> Simple systems have simple problems.
> Complex systems have complex problems.
>
> Worst down-time events I've ever seen always seem to involve a RAID
> system, usually managed by someone who said, "does NOT go down!", who
> believed that complexity was the solution to a problem
>   
how exactly did the machine go down then, i wonder?
> A RAID controller never causes downtime in a system its not installed
> in.  Power distribution boards don't fail on machines that don't have
> them.  Hotplug backplanes don't fail on machines that don't have them.
> (seen 'em all happen).
>   
flawless logic sir, i wish courts would apply it in the same way
concerning rapists genitals, and lying politicians left brainhalves (a
study i read suggested the left side is most active when you lie).
>   
>> * this is _very_ important if your machine is hosted where you don't
>> have easy physical access to it. Machines at a colo center would be a
>> very common scenario.
>>     
>
> That is correct... IF that was what we were talking about.  It isn't.
> You keep trying to use the wrong special case for the topic at hand.
>   
I don't think an firewall should be any less failsafe or easy to
maintain than one at a colo, BUT the colo is for sure more important in
that respect.
> Design your solutions to meet the problem in front of you, not a totally
> unrelated problem.
>   
I don't think they are unrelated.
>   
>>>  RAID is great when you have constantly changing data and you
>>> don't want to lose ANYTHING EVER (i.e., mail server).  When you have a
>>> mostly-static system like a firewall, there are simpler and better ways.
>>>   
>>>       
>> RAID is great for any server.
>>     
>
> WRONG.
> It is good for the right systems in the right places.  There are a lot
> of those places.
> It is great when administered by someone who understands the limitations
> of it.  That, sadly, is uncommon.
>   
Ok, maybe its not so good for someone that doesn't understand what it
does, or how to set it up. But that applies to so much more than just
raid systems.
Thereby not said its always _necessary_. But still good to have imho.
>   
>> So are scsi drives. 
>>     
>
> I've been hearing that "SCSI is better!" stuff for 20 years, most of
> that while working in service and support of LOTS of companys' computers.
>
> It *may* be true that SCSI drives are more reliable than IDE drives,
> though I really suspect if it is really true on average, the variation
> between models is probably greater than the difference between
> interfaces.  But that's just the drive, and I'm giving you that.
>
> HOWEVER, by the time you add the SCSI controller, the software and the
> other stuff in a SCSI solution, you have a much more cranky beast than
> your IDE disk systems usually are.  No, it isn't supposed to be that
> way, but experience has shown me that SCSI cards suck, SCSI drivers
> suck, you rarely have the right cables and terminators on hand, and
> people rarely screw up IDE drivers or chips as badly as they do the SCSI
> chips and drivers (and I am most certainly not talking just OpenBSD
> here).  No question in my mind on this.  I've seen too many bad things
> happen with SCSI...none of which that should have...but they did, anyway.
>   
Well there is something we can agree on. The umpteen different interface
standards, those are very annoying.
I have not really had any problems with neither scsi card drivers or
raid controller drivers in either linux or any bsd that I've used, but
you may have had different experiences?
Since SCSI _is_ a more complex system than IDE/SATA its not surprising
if those drivers historically have had more bugs in them. Especially not
with some manufacturers stupid non-disclosure BS.
>   
>> If you are a company
>> that loses more money on a few hours (or even minutes) downtime than it
>> costs to invest in proper servers with proper hw raid + scsi disks, then
>> you are ill-advised _not_ to raid all your missioncritical servers. And
>> have backup machines, too!  Preferably loadbalanced.
>>     
>
> No, if controlling downtime is important to you, you have to look at the
> ENTIRE solution, not chant mantras that you don't fully understand about
> tiny little bits of individual computers that make up whole systems
> (note: "system" here being used to indicate much more than one computer).
>   
What do you know about the rest of my systems eh?  :p
I never said it was about _one_ computer only. I'm just saying that to
me, saving a few dollars and using some dual IDE disk setup instead of
spending a little more and using raid/scsi on an enterprise firewall,
makes more sense. But I'm still carp:ing them.
>   
>>> A couple months ago, our Celeron 600 firewall seemed to be having
>>> "problems", which we thought may have been due to processor load.  We
>>> were able to pull the disk out of it, put it in a much faster machine,
>>> adjust a few files, and we were back up and running quickly...and found
>>> that the problem was actually due to a router misconfig and a run-away
>>> nmap session.  Would not have been able to do that with a RAID card.
>>>   
>>>       
>> Next time, you may want to check what the machine is actually doing
>> before you start blaming your hardware.
>> I personally would not trust the OS setup on one machine to run smoothly
>> in any machine not more or less identical to itself as far as the hw
>> goes. Especially not for a production unit.
>>     
>
> Ah, a windows user, I see. ;)
> Understand how OpenBSD works, you will understand that this is not a
> problem.  It is the same kernel, same supporting files installed to the
> same places in the same way and doing the same thing, whether it be on a
> 486 or a P4.  It is just a (very) few config files that are different.
> It's truly wonderful.  It's how things should be.
>   
hehe, you are right about that I did try and transplant a windows disk a
looong time and that didn't go very well ;)
Have never transplanted an openbsd disk, but I imagine that would go
better provided you hadn't customized the kernel for that one machine it
was running on.
>   
>> But if you really wanted too, you could move the entire raid array over
>> to a different machine, if that makes you happy.
>>     
>
> Assuming you have practiced and practiced and practiced this process.
> Do it wrong, you can kiss all copies of your data bye-bye, too.  Some
> RAID controllers make it really easy to do this.  Others make it really
> easy to clear your disks of all data...And sometimes, two cards with
> really similar model numbers in machines you thought were really close
> to being the same have really big differences you didn't anticipate.
>   
good point, haven't done this myself ever. And hoping that it stays that
way.
> Don't get me wrong, RAID has its place, and it has a very good place on
> a lot of systems, maybe even most systems that call themselves servers
> (and if it wasn't for the cost, most systems that call themselves
> workstations, too).  I have a snootload of different types of RAID
> systems around here (and btw, bioctl(8) rocks!).  My firewall runs
> ccd(4) mirroring, in fact (mostly because I'm curious how it fails in
> real life.  All things considered, I much prefer the design I described
> earlier).
>
> But in /this/ case, we are talking about a particular application,
> firewalls in an office.  It doesn't really matter one bit what would be
> more appropriate at a CoLo site if that's not what we are talking about.
>   
actually it does for some installations I do, because I administer them
remotely from another office!
> OpenBSD makes it almost trivial to make entire redundant pairs of
> machines.  Think of it as RAID on steroids...it isn't just redundant
> disks, it is redundant NICs, power supplies, disk controllers, cables,
> processors, memory, cases...everything.  PLUS, it not only helps you
> with your uptime in case of failure, it also makes a lot of other things
> easier, too, such as software and hardware upgrades, so you are more
> likely to do upgrades when needed.  At this point, RAID and redundant
> power supplies and such just make life more expensive and more complex,
> not better.
>   
Very much agree with all this.
The only bad thing here is that you sometimes can not fit more machines
into places.
>
> Last February, I had an opportunity to replace a bunch of old servers at
> my company's branch offices.  About 11 branches got big, "classic"
> servers, about 15 smaller branches got workstations converted into
> servers by adding an Accusys box for simple SATA mirroring.  The big
> branches needed the faster performance of the big server, the small
> branches just needed to get rid of the old servers that were getting
> very unreliable (guess what?  It was the SCSI back planes and the RAID
> controllers that were causing us no end of trouble).  It has been an
> interesting test of "server vs. workstation" and "SATA vs. SCSI".
>   
Really, what kind of raid controllers were they?
> It is actually hard to tell who is ahead...
>     Disk failures have been a close call: about the same number of SCSI
> disks have failed as SATA disks, but the SCSI systems have ten disks vs.
> three for the SATA machines, so there are more SCSI disks, but fewer
> SCSI systems.  You will look at the disk count, the users look at "is my
> system working?".
>     Three SCSI disks have off-lined themselves, but simply unplugging
> and plugging them back in has resulted in them coming back up, and
> staying up.  Scary.
>   
hmms, that would worry me a little too :p .. how old are these disks?
>     Most of the SATA failures have been "clean" failures, though one
> drive was doing massive retries, and eventually "succeeded", so the
> system performance was horrible until we manually off-lined the
> offending drive (which was easy to spot by looking at the disk activity
> lights).
>     One system actually lost data: one of the SCSI systems had a disk
> off-line itself that was not noted by on-site staff, and a week later,
> that drive's mirror failed, too (note: the first drive just off-lined
> itself..no apparent reason, and it happily went back on-line).
> Unfortunately, the second-rate OS on these things lacked something like
> bioctl(4) to easily monitor the machines...  Complexity doesn't save you
> from user error...though it might add to it.
>     The drive failures on the SATA systems immediately results in a
> phone call, "there's this loud beeping coming from my server room!".
>     We went through a month or two where drives seemed to be popping all
> over the place...and since then, things have been very reliable...
>     Working with the RAID system on the SCSI machines is something that
> needs to be practiced...working with the Accusys boxes is simplicity in
> the extreme.
>     The little machines are cheap enough we have three spares in boxes
> waiting to be next-day shipped to anywhere we MIGHT have a problem.
>     The big machines cost a lot of money to next-day ship anywhere, so
> we don't even think of it unless we are sure we got a big problem.
>     Only one machine has been next-day shipped: one of the big ones, at
> a price of about 1/4th the cost of an entire little machine (after the
> dual-disk failures, I figured let's get a new machine on site, get them
> back up, and we'll worry about the old machine later).
>   
sounds like you're working for a pretty massive organization there.. may
I ask which one?
> About eight months into the project, I can say, the performance of the
> big RAID 1+0 systems rock, but I love the simplicity of the little
> machines...Ask me again in three years. :)
>   
heh, I'll put that in my calendar :p


Alec
> Nick.

Re: CPU selection

Reply via email to