Todd H. Poole wrote: > Hmmm. Alright, but supporting hot-swap isn't the issue, is it? I mean, > like I said in my response to myxiplx, if I have to bring down the > machine in order to replace a faulty drive, that's perfectly acceptable - > I can do that whenever it's most convenient for me. > > What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_) > is if the machine hangs/freezes/locks up or is otherwise brought down by > an isolated failure in a supposedly redundant array... Yanking the drive > is just how I chose to simulate that failure. I could just as easily have > decided to take a sledgehammer or power drill to it,
But you're not attempting hotswap, you're doing hot plug.... and unless you're using the onboard bios' concept of an actual RAID array, you don't have an array, you've got a JBOD and it's not a real JBOD - it's a PC motherboard which does _not_ have the same electronic and electrical protections that a JBOD has *by design*. > http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30 > part) http://www.youtube.com/watch?v=naKd9nARAes > > and the machine shouldn't have skipped a beat. After all, that's the > whole point behind the "redundant" part of RAID, no? Sigh. > And besides, RAID's been around for almost 20 years now... It's nothing > new. I've seen (countless times, mind you) plenty of regular old IDE > drives fail in a simple software RAID5 array and not bring the machine > down at all. Granted, you still had to power down to re-insert a new one > (unless you were using some fancy controller card), but the point > remains: the machine would still work perfectly with only 3 out of 4 > drives present... So I know for a fact this type of stability can be > achieved with IDE. And you're right, it can. But what you've been doing is outside the bounds of what IDE hardware on a PC motherboard is designed to cope with. > What I'm getting at is this: I don't think the method by which the drives > are connected - or whether or not that method supports hot-swap - should > matter. Well sorry, it does. Welcome to an OS which does care. > A machine _should_not_ crash when a single drive (out of a 4 > drive ZFS RAID-Z array) is ungracefully removed, regardless of how > abruptly that drive is excised (be it by a slow failure of the drive > motor's spindle, by yanking the drive's power cable, by yanking the > drive's SATA connector, by smashing it to bits with a sledgehammer, or by > drilling into it with a power drill). If the controlling electronics for your disk can't handle it, then you're hosed. That's why FC, SATA (in SATA mode) and SAS are much more likely to handle this out of the box. Parallel SCSI requires funky hardware, which is why those old 6- or 12-disk multipacks are so useful to have. Of the failure modes that you suggest above, only one is going to give you anything other than catastrophic failure (drive motor degradation) - and that is because the drive's electronics will realise this, and send warnings to the host.... which should have its drivers written so that these messages are logged for the sysadmin to act upon. The other failure modes are what we call catastrophic. And where your hardware isn't designed with certain protections around drive connections, you're hosed. No two ways about it. If your system suffers that sort of failure, would you seriously expect that non-hardened hardware would survive it? > So we've established that one potential work around is to use the ahci > instead of the pci-ide driver. Good! I like this kind of problem solving! > But that's still side-stepping the problem... While this machine is > entirely SATA II, what about those who have a mix between SATA and IDE? > Or even much larger entities whose vast majority of hardware is only a > couple of years old, and still entirely IDE? If you've got newer hardware, which can support SATA in native SATA mode, USE IT. Don't _ever_ try that sort of thing with IDE. As I mentioned above, IDE is not designed to be able to cope with what you've been inflicting on this machine. > I'm grateful for your help, but is there another way that you can think > of to get this to work? You could start by taking us seriously when we tell you that what you've been doing is not a good idea, and find other ways to simulate drive failures. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss