>>>>> "jcm" == James C McPherson <[EMAIL PROTECTED]> writes: >>>>> "thp" == Todd H Poole <[EMAIL PROTECTED]> writes: >>>>> "mh" == Matt Harrison <[EMAIL PROTECTED]> writes: >>>>> "js" == John Sonnenschein <[EMAIL PROTECTED]> writes: >>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >>>>> "cg" == Carson Gaspar <[EMAIL PROTECTED]> writes:
jcm> Don't _ever_ try that sort of thing with IDE. As I mentioned jcm> above, IDE is not designed to be able to cope with [unplugging jcm> a cable] It shouldn't have to be designed for it, if there's controller redundancy. On Linux, one drive per IDE bus (not using any ``slave'' drives) seems like it should be enough for any electrical issue, but is not quite good enough in my experience, when there are two PATA busses per chip. but one hard drive per chip seems to be mostly okay. In this SATA-based case, not even that much separation was necessary for Linux to survive on the same hardware, but I agree with you and haven't found that level with PATA either. OTOH, if the IDE drivers are written such that a confusing interaction with one controller chip brings down the whole machine, then I expect the IDE drivers to do better. If they don't, why advise people to buy twice as much hardware ``because, you know, controllers can also fail, so you should have some controller redundancy''---the advice is worse than a waste of money, it's snake oil---a false sense of security. jcm> You could start by taking us seriously when we tell you that jcm> what you've been doing is not a good idea, and find other ways jcm> to simulate drive failures. well, you could suggest a method. except that the whole point of the story is, Linux, without any blather about ``green-line'' and ``self-healing,'' without any concerted platform-wide effort toward availability at all, simply works more reliably. thp> So aside from telling me to "[never] try this sort of thing thp> with IDE" does anyone else have any other ideas on how to thp> prevent OpenSolaris from locking up whenever an IDE drive is thp> abruptly disconnected from a ZFS RAID-Z array? yeah, get a Sil3124 card, which will run in native SATA mode and be more likely to work. Then, redo your test and let us know what happens. The not-fully-voiced suggestion to run your ATI SB600 in native/AHCI mode instead of pci-ide/compatibility mode is probably a bad one because of bug 6665032: the chip is only reliable in compatibility mode. You could trade your ATI board for an nVidia board for about the same price as the Sil3124 add-on card. AIUI from Linux wiki: http://ata.wiki.kernel.org/index.php/SATA_hardware_features ...says the old nVidia chips use nv_sata driver, and the new ones use the ahci driver, so both of these are different from pci-ide and more likely to work. Get an old one (MCP61 or older), and a new one (MCP65 or newer), repeat your test and let us know what happens. If the Sil3124 doesn't work, and nv_sata doesn't work, and AHCI on newer-nVidia doesn't work, then hook the drives up to Linux running IET on basically any old chip, and mount them from Solaris using the built-in iSCSI initiator. If you use iSCSI, you will find: you will get a pause like with NT. Also, if one of the iSCSI targets is down, 'zpool status' might hang _every time_ you run it, not just the first time when the failure is detected. The pool itself will only hang the first time. Also, you cannot boot unless all iSCSI targets are available, but you can continue running if some go away after booting. Overall IMHO it's not as good as LVM2, but it's more robust than plugging the drives into Solaris. It also gives you the ability to run smartctl on the drives (by running it natively on Linux) with full support for all commands, while someone here who I told to run smartctl reported that on Solaris 'smartctl -a' worked but 'smartctl -t' did not. I still have performance problems with iSCSI. I'm not sure yet if they're unresolvable: there are a lot of tweakables with iSCSI, like disabling Nagle's algorithm, and enabling RED on the initiator switchport, but first I need to buy faster CPU's for the targets. mh> Dying or dead disks will still normally be able to mh> communicate with the driver to some extent, so they are still mh> "there". The dead disks I have which don't spin also don't respond to IDENTIFY(0) so they don't really communicate with the driver at all. now, possibly, *possibly* they are still responsive after they fail, and become unresponsive after the first time they're rebooted---because I think they load part of their firmware off the platters. Also, ATAPI standard says that while ``still communicating'' drives are allowed to take up to 30sec to answer each command, which is probably too long to freeze a whole system. and still, just because ``possibly,'' it doesn't make sense to replace a tested-working system with a tested-broken system, not even after someone tells a complicated story trying to convince you the broken system is actually secretly working, just completely impossible to test, so you have to accept it based on stardust and fantasy. js> yanking the drives like that can seriously damage the js> drives or your motherboard. no, it can't. And if I want a software developer's opinion on what will electrically damage my machine, I'll be sure to let you know first. jcm> If you absolutely must do something like this, then please use jcm> what's known as "coordinated hotswap" using the cfgadm(1m) jcm> command. jcm> Viz: jcm> (detect fault in disk c2t3d0, in some way) jcm> # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect jcm> c2::dsk/c2t3d0 so....dont dont DONT do it because its STUPID and it might FRY YOUR DISK AND MOTHERBOARD. but, if you must do it, please warn our software first? I shouldn't have to say it, but aside from being absurd this warning-command completely defeats the purpose of the test. jcm> Yes, but you're running a new operating system, new jcm> filesystem... that's a mountain of difference right in front jcm> of you. so we do agree that Linux's not freezing in the same scenario indicates the difference is inside that mountain, which, however large, is composed entirely of SOFTWARE. re> The behavior of ZFS to an error reported by an underlying re> device driver is tunable by the zpool failmode property. By re> default, it is set to "wait." I think you like speculation well enough, so long as it's optimistic. which is the tunable setting that causes other pools, ones not even including failed devices, to freeze? Why is the failmode property involved at all in a pool that still has enough replicas to keep functioning? cg> We really need to fix (B). It seems the "easy" fixes are: cg> - Configure faster timeouts and fewer retries on redundant cg> devices, similar to drive manufacturers' RAID edition cg> firmware. This could be via driver config file, or (better) cg> automatically via ZFS, similar to write cache behaviour. cg> - Propagate timeouts quickly between layers (immediate soft cg> fail without retry) or perhaps just to the fault management cg> system It's also important that things unrelated to the failure aren't frozen. This was how I heard the ``green line'' marketing campaign when it was pitched to me, and I found it really compelling because I felt Linux had too little of this virtue. However compelling, I just don't find it even slightly acquainted with reality. I can understand ``unrelated'' is a tricky concept when the boot pool is involved, but for example when it isn't involved: I've had problems where one exported data pool's becoming FAULTED stops NFS service from all other pools. The pool that FAULTED contained no Solaris binaries. and the zpool status hangs people keep discovering. I think this is a good test in general: configure two almost-completely independent stacks through the same kernel: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver driver controller controller disks disks Simulate whatever you regard as a ``catastrophic'' or ``unplanned'' or ``really stupid'' failure, and see how big the shared region in the middle can be without affecting the other stack. Right now, my experience is even the stack above does not work. Maybe mountd gets blocked or something, I don't know. Optimistically, we would of course like this stack below to remain failure-separate: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver controller disks disks The OP is implying, on Linux that stack DOES keep failures separate. However, even if ``hot plug'' (or ``hot unplug'' for demanding Linux users) is not supported, at least this stack below should still be failure-independent: NFS export NFS export filesystem filesystem pool pool ZFS/NFS driver controller controller disks disks I suspect it isn't because the less-demanding stack I started with isn't failure-independent. There is probably more than one problem making these failures spread more widely than they should, but so far we can't even agree on what we wish were working. I do think the failures need to be isolated better first, independent of time. It's not ``a failure of a drive on the left should propogate up the stack faster so that the stack on the right unfreezes before anyone gets too upset.'' The stack on the right shouldn't freeze at all.
pgpUoJXh9ienx.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss