> But you're not attempting hotswap, you're doing hot plug.... Do you mean hot UNplug? Because I'm not trying to get this thing to recognize any new disks without a restart... Honest. I'm just trying to prevent the machine from freezing up when a drive fails. I have no problem restarting the machine with a new drive in it later so that it recognizes the new disk.
> and unless you're using the onboard bios' concept of an actual > RAID array, you don't have an array, you've got a JBOD and > it's not a real JBOD - it's a PC motherboard which does _not_ > have the same electronic and electrical protections that a > JBOD has *by design*. I'm confused by what your definition of a RAID array is, and for that matter, what a JBOD is... I've got plenty of experience with both, but just to make sure I wasn't off my rocker, I consulted the demigod: http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/JBOD and I think what I'm doing is indeed RAID... I'm not using some sort of controller card, or any specialized hardware, so it's certainly not Hardware RAID (and thus doesn't contain any of the fancy electronic or electrical protections you mentioned), but lacking said protections doesn't preclude the machine from being considered a RAID. All the disks are the same capacity, the OS still sees the zpool I've created as one large volume, and since I'm using RAID-Z (RAID5), it should be redundant... What other qualifiers out there are necessary before a system can be called RAID compliant? If it's hot-swappable technology, or a controller hiding the details from the OS and instead presenting a single volume, then I would argue those things are extra - not a fundamental prerequisite for a system to be called a RAID. Furthermore, while I'm not sure what the difference between a "real JBOD" and a plain old JBOD is, this set-up certainly wouldn't qualify for either. I mean, there is no concatenation going on, redundancy should be present (but due to this issue, I haven't been able to verify that yet), and all the drives are the same size... Am I missing something in the definition of a JBOD? I don't think so... > And you're right, it can. But what you've been doing is outside > the bounds of what IDE hardware on a PC motherboard is designed > to cope with. Well, yes, you're right, but it's not like I'm making some sort of radical departure outside of the bounds of the hardware... It really shouldn't be a problem so long as it's not an unreasonable departure because that's where software comes in. When the hardware can't cut it, that's where software picks up the slack. Now, obviously, I'm not saying software can do anything with any piece of hardware you give it - no matter how many lines of code you write, your keyboard isn't going to turn into a speaker - but when it comes to reasonable stuff like ensuring a machine doesn't crash because a user did something with the hardware that he or she wasn't supposed to do? Prime target for software. And that's the way it's always been... The whole push behind that whole ZFS Promise thing (or if you want to make it less specific, the attractiveness of RAID in general), was that "RAID-Z [wouldn't] require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z) > Well sorry, it does. Welcome to an OS which does care. The half-hearted apology wasn't necessary... I understand that OpenSolaris cares about the method those disks use to plug into the motherboard, but what I don't understand is why that limitation exists in the first place. It would seem much better to me to have an OS that doesn't care (but developers that do) and just finds a way to work, versus one that does care (but developers that don't) and instead isn't as flexible and gets picky... I'm not saying OpenSolaris is the latter, but I'm not getting the impression it's the former either... > If the controlling electronics for your disk can't > handle it, then you're hosed. That's why FC, SATA (in SATA > mode) and SAS are much more likely to handle this out of > the box. Parallel SCSI requires funky hardware, which is why > those old 6- or 12-disk multipacks are so useful to have. > > Of the failure modes that you suggest above, only one > is going to give you anything other than catastrophic > failure (drive motor degradation) - and that is because the > drive's electronics will realise this, and send warnings to > the host.... which should have its drivers written so > that these messages are logged for the sysadmin to act upon. > > The other failure modes are what we call catastrophic. And > where your hardware isn't designed with certain protections > around drive connections, you're hosed. No two ways > about it. If your system suffers that sort of failure, would > you seriously expect that non-hardened hardware would survive it? Yes, I would. At the risk of sounding repetitive, I'll summarize what I've been getting at in my previous responses: I certainly _do_ think it's reasonable to expect non-hardened hardware to survive this type of failure. In fact, I think its unreasonable _not_ to expect it to. The Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs Windows) all provide this type of functionality, and have so for some time. Granted, they may all do it in different ways, but at the end of the day, unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat, FreeBSD, or Windows XP Professional will not bring the machine down. And it shouldn't in OpenSolaris either. There might be some sort of noticeable bump (Windows, for example, pauses for a few seconds while it tries to figure out what hell just happened to one of it's disks), but there isn't anything show stopping... > If you've got newer hardware, which can support SATA > in native SATA mode, USE IT. I'll see what I can do - this might be some sort of BIOS setting that can be configured. > > I'm grateful for your help, but is there another way that you can think > > of to get this to work? > You could start by taking us seriously when we tell > you that what you've been doing is not a good idea, and > find other ways to simulate drive failures. Lets drop the confrontational attitude - I'm not trying to dick around with you here. I've done my due diligence in researching this issue on Google, these forums, and Sun's documentation before making a post, I've provided any clarifying information that has been requested by those kind enough to post a response, and I've yet to resort to any witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever is causing you to think I'm not taking anyone seriously, let me reassure you, I am. The only thing I'm doing is testing a system by applying the worst case scenario of survivable torture to it and seeing how it recovers. If that's not a good idea, then I guess we disagree. But that's ok - you're James C. McPherson, Senior Kernel Software Engineer, Solaris, and I'm just some user who's trying to find a solution to his problem. My bad for expecting the same level of respect I've given two other members of this community to be returned in kind by one of it's leaders. So aside from telling me to "[never] try this sort of thing with IDE" does anyone else have any other ideas on how to prevent OpenSolaris from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array? -Todd This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss