>>>>> "jb" == Jeff Bonwick <[EMAIL PROTECTED]> writes: >>>>> "rmc" == Ricardo M Correia <[EMAIL PROTECTED]> writes:
jb> We need a little more Code of Hammurabi in the storage jb> industry. It seems like most of the work people have to do now is cleaning up after the sloppyness of others. At least it takes the longest. You could always mention which disks you found ignoring the command---wouldn't that help the overall problem? I understand there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but I don't understand where it comes from. http://www.ferris.edu/news/jimcrow/tom/ jb> displacement flush for disk caches that ignore the sync jb> command. Sounds like a good idea but: (1) won't this break the NFS guarantees you were just saying should never be broken? I get it, someone else is breaking a standard so how can ZFS be expected to yadda yadda yadad. But I fear it will just push ``blame the sysadmin'' one step further out. ex., Q. ``with ZFS all my NFS clients become unstable after the server reboots,'' or ``I'm getting silent corruption with NFS''. A. ``your drives might have gremlins in them, no way to know,'' and ``well what do you expect without a single integrity domain and TCP's weak checksums. / no i'm using a crossover cable, and FCS is not weak. / ZFS managing a layer of redundancy it is probably your RAM or corruption on the uh, between the Ethernet MAC chip and the PCI slot'' (1a) I'm concerned about how it'll be reported when it happens. (a) if it's not reported at all, then ZFS is hiding the fact that fsync() is not working. Also, other journaling filesystems sometimes report when they find ``unexpected'' corruption, which is useful for finding both hardware and software problems. I'm already concerned ZFS is not reporting enough, like when it says a vdev component is ONLINE, but 'zpool offline pool <component>' says 'no valid replicas', then after a scrub there is no change to zpool status, but zpool offline works again. ZFS should not ``simplify'' the user interface to the point that it's hiding problems with itself and its environment to the ends of avoiding discussion. (b) if it is reported, then whenever the reporter-blob raises its hand it will have the effect of exonerating ZFS in most people's minds, like the stupid CKSUM column does right now. ``ZFS-FEED-B33F error? oh yeah that's the new ueberblock search code. that means your disks are ignoring the SYNCHRONIZE CACHE command. thank GOD you have ZFS with ANY OTHER FILESYSTEM all bets would be totally off. lucky you. / I have tried ten different models from all four brands. / yeah sucks don't it? flagrant violation of the standard, industry wide. / my linux testing tool says they're obeying the command fine / linux is crap / i added a patch to solaris to block the SYNC CACHE command and the disks got faster so I think it's not being ignored / well the stack is complicated and flushing happens at many levels, like think about controller performance, and that's completely unsupported you are doing something REALLY UNSAFE there you should NOT DO THAT it is STUPID'' and so on, stalling the actual fix literally for years. The right way to exonerate ZFS is to make a diagnosis tool for the disks which proves they're broken, and then don't buy those disks. not to make a new class of ZFS fault report that could potentially capture all kinds of problems, then hazily assign blame to an untestable quantity. (2) disks are probably not the only thing dropping the write barriers. So far, we're also suspecting (unproven!) iSCSI targets/initiators, particularly around a TCP reconnection event or target reboot. and VM stacks, both VirtualBox and the HVM in UltraSPARC T1. probably other stuff. I'm concerned that assumptions you'll find safe to make about disks after you get started, like nothing is more than 1s stale, or send a CDB to size the on-disk cache and imagine it's a FIFO and it'll be no worse than that, or ``you can get an fsync by pausing reads for 500ms'' or whatever, will add robustness for current and future broken disks but won't apply to other types of broken storage layer. rmc> However, it is not so resilient when the storage system rmc> suffers hiccups which cause phantom writes to occur rmc> continuously, even if for a small period of time (say less rmc> than 10 seconds), and then return to normal. ha! that is a great idea. temporal ditto blocks: Important writes should be written, aged in RAM for 1 minute, then rewritten. :) This will help with latent sector errors caused by powersag/vibration too. but...Even I will admit at some point you have to give up and let the filesystem get corrupted. actually I'm more in the camp of making ZFS fragile to incorrect storage stacks, and offering an offline recovery tool that treats the corrupt pool as read-only and copies it into a new filesystem (so you need a second same-size empty pool to use the tool). I like this painful way better than fsck-like things, and much better than silent workarounds. but i'm probably in the wrong camp on this one. My reasoning is, we will not be ultimately happy with a fileystem where fsync() is broken, and that's the best you can do. To compete with Netapp, we need to bang on this thing until it's actually working. So far I think sysadmins are receptive to the idea they need to fix <...> about their setup, or make purchases with extreme care, or do testing before production. We are not lazy and do not expect an appliance-on-a-CD. it's just that pass-the-buck won't ever deliver something useful. When ext3 was corrupting filesystems on laptops, ext3 got blamed, and ext3 was not at the root of the problem. But no one _accepted_ that ext3 was correctly-coded until the overall problem was fixed. (IIRC it was: you need to send drives a stop-unit command before sending the ACPI powerdown, because even if they ignore synchronize-cache they do still flush when told to stop-unit) It's proper to have a strict separation between ``unclean shutdown'' and ``recovery from corruption''. UFS does have the separation between log-rolling and fsck-ing, but ZFS could detect the difference between unclean shutdown and corruption a lot better than UFS, and that's good. Currently ZFS seems to detect it by telling you ``pool's corrupt. <shrug>, destroy it.''---the fact that the recovery tool is entirely absent isn't good, but keeping recovery actions like this ueberblock-search strictly separate makes delivering something truly correct on the ``unclean shutdown'' front more likely. I think, if iSCSI target/initiator combinations are silently discarding 10sec worth of writes (ex., when they drop and reconnect their TCP session), then this needs to be proven and their implementation can be and needs to be corrected, not speculated on and then worked around. And I bet this same beefing-up performance numbers by discarding cache flushes is as rampant in the virtualization game as in the hard disk game.
pgpwdQOwATAGG.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss