Theodore Wynnychenko wrote: ... > Anyway, I had not considered the possibility of a controller failure.
bad... > I also wondered if it was possible to remove a drive from the mirrored > hardware array, and see if it is recognized by a plain old SATA controller. > So, I did this by shutting the system down, enabling the motherboard's SATA > controller, and moving one of the drive cables to this standard SATA > controller. good... > Unfortunately, while the array comes up and is accessible, even though it is > degraded; I cannot access the drive now attached to the standard SATA > controller. If I try to fsck it, I get an "unknown special file or file > system" message. > > So, it seems, with ami and this megaraid card, I will not be able to recover > from a controller failure by hooking a drive up to a standard SATA > controller. and thus, you find that one person's experience with ONE set of hardware can not be universally generalized. > So, my question: How likely is a raid controller failure (with the LSI > Megaraid PCI cards), Wrong question. Right question would be: so what do you do when it does fail? Spare RAID hardware is required if rapid repair is needed if diving for your backup system is not what you are after, after the failure of one little component. And...I do believe that is the point of RAID. > and would I be better off just chucking the Megaraid > card and using software raid with the drives connected via the standard SATA > controllers? If you have a single, non-redundant drive, you KNOW you have a significant exposure to failure, and you will have backups and such, or be ready to take a bit of downtime and data loss. When you implement RAID, you start telling yourself you have all kinds of tolerance to unhappy events, and then you start to believe it. With a spare controller, you could have rapid repair. Without a spare controller, the failure of the controller has the EXACT same impact that the failure of a drive on a single drive, non-RAID system: complete and total loss of data on the system and recovery of your data from the backup which you wish you had been making. here's the fun (=terrifying) part: WITH a spare controller, things can get at least as unpleasant as above. If you just set up your RAID system and "get it working" and then toss it into production, and hope magic will happen when something breaks, you would probably have been better off with a simpler configuration with no RAID. I'm serious. The worst data-loss events I have seen involved RAID. If you don't understand your chosen RAID solution, given enough time and/or enough opportunity, your downtime will be extended, and your data will be lost. Not only should you have "similar" spare controllers, you need IDENTICAL spare controllers -- right down to the firmware versions. Yes, I have seen firmware notes which had big warnings of "unable to read arrays made by version X of the firmware or hardware". Are you ready to bet that every option on this version of the card you have is compatible with every option of the card you hope to buy when your existing one dies? If you believe this, I ask you a question: why are there so many updates to RAID controller firmware made if they are so perfect? What's the first bit of advice you always seem to hear on a new server setup? "Update the firmware". Why? Because the old one was crap. Amazingly, the one you just installed is now perfect. Yeah, right. It's a numbers game. If you have five disks on a machine, maybe you have a 1 in 4 chance of failing in a two year life cycle of the machine. If you put a sixth disk in the machine and RAID the bunch of 'em, you get a MUCH lower likelihood of failure due to a disk, but much higher due to failure of a RAID controller (which had no chance of failing on the system that didn't have it), maybe 1 in 20 (or maybe one in 200 or one in 2000..whatever). You also add the possibility of failure of process (i.e., one drive fails, you say, "what's the rush?" and rather than rushing out and buying a new drive, you send the old one off for warranty replacement, and in the weeks you wait, a second drive fails. I saw this recently... big array with a few terabytes of important data blew out a disk. First failure was not having a spare disk on hand. Second failure was rather than running out and BUYING a new disk and putting it in the machine, the old drive was sent off for warranty replacement, and the system ran without redundancy until it came back. At that point, they put a very firm value on the safety of their data: <$100. In this case, the disaster did not happen. Pulling numbers out of thin air, I'd say failure of process might be something like 1 in 3 when you think you have hardware to save you from failures...more like 1 in 10 when you know you are living on the edge). (90% of all statistics are made up. 100% of those were...but they may be closer than you would like). If that spare controller costs you $400, but loss of data or extended downtime is worth $4000 to you, ok, maybe it is worth the gamble of living without a spare. If your data is worth, say, $40k, maybe you just better spend the money for a spare controller, regardless of what you think the likelihood of failure is. But now that you have your RAID system in place, you had better experiment with it to make sure you know how to handle drive failure and replacement, and in the case of hardware, how to handle HARDWARE failure and replacement (i.e., move the disk pack to a new controller). Hint: it is often not as easy as you think to move the disk pack to new hardware. It seems hardware vendors often don't consider that anything other than disks fail. *sigh* > I figure there must be some performance loss, but I can't > imagine I will ever notice it looking at old pictures. Here's an alternate potential solution: put two or more simple disks in the computer, and periodically copy the data from the primary to the backup... Not applicable in all cases, but actually superior in some. It gives you a nice opportunity to think about and evaluate changes you made before committing them to all media. (DNS servers and firewalls are two cases where I have used this effectively: the data isn't all that volatile, if you lose a day's changes, not the end of the world, but if you don't do it right, it is really nice to just copy the old file back quickly rather than go digging through tape). Note that this system has its own potential failure modes -- if your primary disk has lost data that the system hasn't noticed yet, it WILL notice during the copy operation, and probably destroy the "mirror" data in the process of noticing...so your backups are still important. > Finally, I am wondering. I had assumed that the hardware controller really > didn't do that much when in RAID1, and just passed the writes/reads to/from > both of the disks, resulting in 2 (basically) normal drives. Obviously, I > was wrong. I am wondering why/how the raid controller needs to modify the > disk's file system when it's only mirroring 2 drives? (I really could not > find anything by google-ing around on this.) here's where I say, "you need to think about and simulate all imaginable failure modes", and you start to understand how these things work. The hardware RAID systems I've worked with treat RAID1 as one of several different RAID systems supported...they don't treat it significantly different than RAID5, RAID1+0, etc. There is more to RAID1 than duping the data to two drives and keeping them the same. All HW RAID systems that I have seen use some kind of signature to mark the drives as part of a RAID set (RAID1 or otherwise). The signature is quite important: * Let's say you have six drives, three RAID1 pairs. While servicing the computer, you unplug the drive cables to extract or install some other card in the machine...you then realize you didn't make note of what cable went where. How do you want your RAID controller to handle this? You probably really hope it spots the pairs and re-connects them appropriately. * What if you need to swap out a drive while the system is off? Which drive does it use on power-up? * What if you replace a drive while the system is off with another drive that was recycled from a system with a compatible RAID card? Which is the one that should be used and which should be ignored? * What if you power up a system and it has two drives attached that it knows nothing about? Pick one and blindly copy it to the other? Assume they are in sync? * What if you can't replace the failed drive with one that is identical to it? What if the new drive is a bigger or a few blocks smaller? The signature that helps resolve the above has to be somewhere on the disk. Some systems try to hide it some place the OS would never notice (I believe I read some tech notes on one system that stuck it on the very last sector of the disk, with the assumption that very few OSs ever put anything there, I've seen one other RAID system that seemed to do that, as the drives COULD be removed from the RAID system and used directly on a standard controller), but others just plop it at the front of the physical disk, and create the array in the remaining space. The point of RAID1 isn't to dupe the data on two drives, the point of RAID1 is to have the system rapidly recoverable when something goes horribly wrong. Duping the data between two drives is the way to meet that end, but there is more to it than two blind copies of the same data. > I hope I don't sound too clueless for asking. No more clueless than 90% of the people out there setting up RAID disasters in waiting... Many very smart people who seem to think that Magic Happens (or that think they will have a job elsewhere) when things go wrong. Nick.