On 07/19/09 06:10 PM, Richard Elling wrote:
Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds.
Yes, but memory hits are instantaneous. On a reasonably busy system there may be buffers in queue all the time. You may have a buffer in memory for 100uS but it only takes 1nS for that buffer to be clobbered. If that happened to be metadata about to be written to both sides of a mirror than you are toast. Good thing this never happens, right :-)
Beware, if you go down this path of thought for very long, you'll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you'll be afraid to go to bed instead :-)
Not at all. As with any rational business, my servers all have ECC, and getting up and out isn't a problem :-). Maybe I've had too many disks go bad, so I have ECC, mirrors, and backup to a system with ECC and mirrors (and copies=2, as well). Maybe I've read too many of your excellent blogs :-).
Sun doesn't even sell machines without ECC. There's a reason for that.
Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems.
Not sure I follow. We've had this discussion before. OSOL+ZFS lets you build enterprise class systems on cheap hardware that has errors. ZFS gives the illusion of being fragile because it, uniquely, reports these errors. Running OSOL as a VM in VirtualBox using MSWanything as a host is a bit like building on sand, but there's nothing in documentation anywhere to even warn folks that they shouldn't rely on software to get them out of trouble on cheap hardware. ECC is just one (but essential) part of that. On 07/19/09 08:29 PM, David Magda wrote:
It's a nice-to-have, but at some point we're getting into the tinfoil hat-equivalent of data protection.
But it is going to happen! Sun sells only machines with ECC because that is the only way to ensure reliability. Someone who spends weeks building a media server at home isn't going to be happy if they lose one media file let alone a whole pool. At least they should be warned that without ECC at some point they will lose files. I'm not convinced that there is any reasonable scenario for losing an entire pool though, which was the original complaint in this thread. Even trusty old SPARCs occasionally hang without a panic (in my experience especially when a disk is about to go bad). If this happens, and you have to power cycle because even stop-A doesn't respond, are you all saying that there is a risk of losing a pool at that point? Surely the whole point of a journalled file system is that it is pretty much proof against any catastrophe, even the one described initially. There have been a couple of (to me) unconvincing explanations of how this pool was lost. Surely if there is a mechanism whereby unflushed i/os can cause fatal metadata corruption, this should be a high priority bug since this can happen on /any/ hardware; it is just more likely if the foundations are shaky, so the explanation must require more than that if it isn't a bug. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss