>>>>> "ok" == Orvar Korvar <knatte_fnatte_tja...@yahoo.com> writes:
ok> Nobody really knows for sure. ok> I will tell people that ZFS + HW raid is good enough, and I ok> will not recommend against HW raid anymore. jesus, ok fine if you threaten to let ignorant Windows morons strut around like arrogant experts, voicing some baseless feel-good opinion and then burdening you with proving a negative, then I'll take the bait and do your homework for you. What people ``don't really know for sure'' is what hardware RAID you mean. There are worlds of difference between NetApp RAID and Dell PERC RAID, differences in their ability to avoid data loss in common failure scenarios, not just in some bulleted features or admin tool GUI's that the Windows admins understand. For example among dirt-cheap RAID5's, whether or not there's a battery-backed write cache influences how likely they are to corrupt the filesystem on top, while the Windows admins probably think it's for ``performance issues'' or something. And there are other things. NetApp has most of them, so if you read a few of the papers they publish bragging about their features you'll get an idea how widely the robustness of RAIDs can vary: http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf Advantages of ZFS-on-some-other-RAID: * probably better performance for RAID6 than for raidz2. It is not because of ``checksum overhead''. It's because RAID6 stripes use less seek bandwidth than ZFS which mostly does full-stripe writes like RAID3. * It's performant to use a filesystem other than ZFS on a hardware RAID, which maybe protects your investment somewhat. * better availability, sometimes. It's important to distinguish between availability and data loss. Availability means when a disk goes bad, applications don't notice. Data loss is about, AFTER the system panics, the sysadmin notices, unplugs bad drives, replaces things, reboots, is the data still there or not? And is *ALL* the data there, satisfying ACID rules for databases and MTA's, or just most of the data there? because it is supposed to be ALL there. Even when you have to intervene, that is not license to break all the promises and descend into chaos. Some (probably most) hardware RAID has better availability, is better at handling failing or disconnected drives than ZFS. But ZFS on JBOD is probably better at avoiding data loss than ZFS on HWRAID. * better exception handling, sometimes. Really fantastically good hardware RAID can handle the case of two mirrored disks, both with unreadable sectors, but none of the unreadable sectors are in common. Partial disk failures are the common case, not the exception, while the lower-quality RAID implementations most of us are stuck with treat disks as either good or bad, as a whole unit. I expect most hardware RAIDs don't have much grace in this department either, though. It seems really difficult to do it well because disks lock up when you touch their bad blocks, so to gracefully extract information from a partially-failed disk, you have to load special firmware into them and/or keep a list of poison blocks which you must never accidentally read. My general procedure for recovering hardware RAIDs is, (0) shut down the RAID and use as JBOD, (1) identify the bad disks and buy blank disks to replace them, (2) copy bad disks onto blank disks using dd_rescue or dd conv=noerror,sync, and (3) run the RAID recovery tool. There are a lot of bad things about this procedure! It denies the RAID layer access to the disk's reports about which sector is bad, which leads to parity pollution, and silent corruption that slips through the RAID layer, discussed below. It can also not work, if the RAID is identifying disks like ZFS sometimes does, by devid/serialno rather than by a disk label or port number. but RAID layers read the report ``bad sector'' as in fact a report ``bad disk!'' so with the very cheap RAID's I've used this 0,1,2,3 corruption-prone procedure which works around the bad exception handling saves more data than the supported procedure. * novolatile cache. Some cheap hardware RAID gives you a battery-backed write cache that can be more expensive to get the ZFS slog way. Advantages to ZFS-on-JBOD: * unified admin tool. There is one less layer, so you can administer everything with zpool. With hardware RAID you will have to use both zpool and the hardware RAID's configurator. * most hardware RAID has no way to deal with silent corruption. It can deal with latent sector errors, when the drive explicitly reports a failed read, but it has no way to deal with drives that silently return incorrect data. There are a huge number of patterns in which this happens in real life according to Netapp, and they have names for them like ``torn writes'' and ``misdirected writes'' and so on. RAID5's read-modify-write behavior can magnify this type of error through ``parity pollution.'' http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf I think there are some tricks in Solaris for Veritas and SVM mirrors, which applications like Oracle can use if they do checksumming above the filesystem level. Oracle can say, ``no, I don't like that block. Can you roll the dice again? What else might be at that lseek offset?'' See DKIODMR under dkio(7I). I'm not sure in exactly what circumstance the tricks work, but am sure they will NOT work on hardware RAID-on-a-card. Also the tricks are not used by ZFS, so if you insert ZFS between your RAID and your application, the DKIODMR-style checksum-redundancy is blocked off. The idea is that you should use ZFS checksums and ZFS redundancy instead, but you can't do that by feeding ZFS a single LUN. * hardware RAID ties your data to a RAID implementation. This is more constraining than tying it to ZFS because the Nexenta and Opensolaris licenses allow you to archive the ZFS software implementation and share it with your friends. This has bitten several of my friends particularly with ``RAID-on-a-card'' products. It is really uncool to lose your entire dataset to a bad controller, and then be unable to obtain the same controller with the same software revision, because the hardware has gone through so many steppings, and the software isn't freely redistributable, archiveable, or even clear that it exists at all. Sometimes the configurator tool is very awkward to run. There may even be a remedial configurator in the firmware and a full-featured configurator which is difficult to run just when you need it most, when recovering from a failure. Some setups are even worse, are unclear about where they are storing the metadata---on the diskset or in the controller? so you could lose the whole dataset because of a few bits in some $5 lithium battery RTC chip. Being a good sysadmin in this environment requires levels of distrust that are really unhealthy. The more expensive implementations hold your data hostage to a support contract, so while you are paying to buy the RAID you are in fact more like renting a place to put your data, and they reserve the right to raise your rent. Without the contract you don't just lose their attention---you cannot get basic things like older software, or manuals for the software you have now, and they threaten to sue you for software license violation if you try to sell the hardware to someone else. * there are bugs either in ZFS or in the overall system that make ZFS much more prone to corruption when it's not managing a layer of redundancy. In particular, we know ZFS does not work well when underlying storage reports writes are committed to disk when they're not, and this problem seems to be rampant: * some SATA disks do it, but the people who know which ones aren't willing to tell. They only say ``a Major vendor.'' * Sun's sun4v hypervisor does it for virtualized disk access on the T2000. * Sun's iSCSI target does it (iscsitadm. not sure yet about Comstar.) * We think many PeeCee virtualization platforms like virtualbox and vmware and stuff might do it. * Linux does it if you are using disk through LVM2. The problem is rampant, seems to be more dangerous to ZFS than other filesystems, and progress in tracking it down and fixing it is glacial. Giving ZFS a mirror or a raidz seems to improve its survival. To work around this with hardware RAID, you need to make a zpool that's a mirror of two hardware RAIDsets. This wastes a lot of disk. If you had that much disk with ZFS JBOD, you could make much better use of it as a filesystem-level backup, like backup with rsync to a non-ZFS filesystem or to a separate zpool with 'zfs send | zfs recv'. You really need this type of backup with zfs because of the lack of fsck and the huge number of panics and assertions. HTH. To my reckoning the consensus best practice is usually JBOD right now. When I defend ZFS-over-RAID5 it is mostly because I think the poor availability during failures and the corruption bugs discussed in the last point need to be tracked down and squashed. Here's my list of papers that have been mentioned on this list, so you can catch up. You can also dump all the papers on the annoying Windows admins, and when they say ``I really think hardware RAID has fewer `issues' because I just think so. It's your job to prove why not,'' then you can answer ``well have you read the papers? No? Then take my word for it.'' If they doubt the papers' authority, then cite the price the people writing the papers charge for their services---the Windows admins should at least understand that. http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf http://www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps http://labs.google.com/papers/disk_failures.html http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf http://www.nber.org/sys-admin/linux-nas-raid.html
pgpCPQd9sQbAy.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss