Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

James C. McPherson Sun, 24 Aug 2008 14:29:33 -0700

Todd H. Poole wrote:
> Hmmm. Alright, but supporting hot-swap isn't the issue, is it? I mean,
> like I said in my response to myxiplx, if I have to bring down the
> machine in order to replace a faulty drive, that's perfectly acceptable -
> I can do that whenever it's most convenient for me.
> 
> What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_)
> is if the machine hangs/freezes/locks up or is otherwise brought down by
> an isolated failure in a supposedly redundant array... Yanking the drive
> is just how I chose to simulate that failure. I could just as easily have
> decided to take a sledgehammer or power drill to it,


But you're not attempting hotswap, you're doing hot plug....
and unless you're using the onboard bios' concept of an actual
RAID array, you don't have an array, you've got a JBOD and
it's not a real JBOD - it's a PC motherboard which does _not_
have the same electronic and electrical protections that a
JBOD has *by design*.

> http://www.youtube.com/watch?v=CN6iDzesEs0 (fast-forward to the 2:30
> part) http://www.youtube.com/watch?v=naKd9nARAes
> 
> and the machine shouldn't have skipped a beat. After all, that's the
> whole point behind the "redundant" part of RAID, no?

Sigh.

> And besides, RAID's been around for almost 20 years now... It's nothing
> new. I've seen (countless times, mind you) plenty of regular old IDE
> drives fail in a simple software RAID5 array and not bring the machine
> down at all. Granted, you still had to power down to re-insert a new one
> (unless you were using some fancy controller card), but the point
> remains: the machine would still work perfectly with only 3 out of 4
> drives present... So I know for a fact this type of stability can be
> achieved with IDE.

And you're right, it can. But what you've been doing is outside
the bounds of what IDE hardware on a PC motherboard is designed
to cope with.

> What I'm getting at is this: I don't think the method by which the drives
> are connected - or whether or not that method supports hot-swap - should
> matter.

Well sorry, it does. Welcome to an OS which does care.

> A machine _should_not_ crash when a single drive (out of a 4
> drive ZFS RAID-Z array) is ungracefully removed, regardless of how
> abruptly that drive is excised (be it by a slow failure of the drive
> motor's spindle, by yanking the drive's power cable, by yanking the
> drive's SATA connector, by smashing it to bits with a sledgehammer, or by
> drilling into it with a power drill).

If the controlling electronics for your disk can't handle
it, then you're hosed. That's why FC, SATA (in SATA mode)
and SAS are much more likely to handle this out of the box.
Parallel SCSI requires funky hardware, which is why those
old 6- or 12-disk multipacks are so useful to have.

Of the failure modes that you suggest above, only one is
going to give you anything other than catastrophic failure
(drive motor degradation) - and that is because the drive's
electronics will realise this, and send warnings to the
host.... which should have its drivers written so that these
messages are logged for the sysadmin to act upon.

The other failure modes are what we call catastrophic. And
where your hardware isn't designed with certain protections
around drive connections, you're hosed. No two ways about it.
If your system suffers that sort of failure, would you seriously
expect that non-hardened hardware would survive it?

> So we've established that one potential work around is to use the ahci
> instead of the pci-ide driver. Good! I like this kind of problem solving!
> But that's still side-stepping the problem... While this machine is
> entirely SATA II, what about those who have a mix between SATA and IDE?
> Or even much larger entities whose vast majority of hardware is only a
> couple of years old, and still entirely IDE?

If you've got newer hardware, which can support SATA in
native SATA mode, USE IT.

Don't _ever_ try that sort of thing with IDE. As I mentioned
above, IDE is not designed to be able to cope with what
you've been inflicting on this machine.

> I'm grateful for your help, but is there another way that you can think
> of to get this to work?

You could start by taking us seriously when we tell you
that what you've been doing is not a good idea, and find
other ways to simulate drive failures.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS hangs/freezes after disk failure,

Reply via email to