Richard Elling wrote: > Adrian Saul wrote: >> Howdy, I have at several times had issues with consumer grade PC >> hardware and ZFS not getting along. The problem is not the disks >> but the fact I dont have ECC and end to end checking on the >> datapath. What is happening is that random memory errors and bit >> flips are written out to disk and when read back again ZFS reports >> it as a checksum failure: >> >> pool: myth state: ONLINE status: One or more devices has >> experienced an error resulting in data corruption. Applications >> may be affected. action: Restore the file in question if possible. >> Otherwise restore the entire pool from backup. see: >> http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: >> >> NAME STATE READ WRITE CKSUM myth ONLINE 0 >> 0 48 raidz1 ONLINE 0 0 48 c7t1d0 ONLINE 0 >> 0 0 c7t3d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 >> 0 0 c6t2d0 ONLINE 0 0 0 >> >> errors: Permanent errors have been detected in the following files: >> >> >> /myth/tv/1504_20080216203700.mpg /myth/tv/1509_20080217192700.mpg >> >> Note there are no disk errors, just entire RAID errors. I get the >> same thing on a mirror pool where both sides of the mirror have >> identical errors. All I can assume is that it was corrupted after >> the checksum was calculated and flushed to disk like that. In the >> past it was a motherboard capacitor that had popped - but it was >> enough to generate these errors under load.
I got a similar CKSUM error recently in which a block from a different file ended up in one of my files. So this was not a simple bit-flip, but 64K of the file was bad. However, I do not think any disk filesystem should tolerate even bit flips. Even in video files, I'd want to know that I hacked the ZFS source to temporarily ignore the error so I could see what was wrong. So your error(s) might be something of this kind (except I do not understand, if so, how both of your mirrors were affected in the same way - do you know this, or did ZFS simply say that the file was not recoverable - i.e. it might have had different bad bits in the two mirrors?). For me, at least on subsequent reboots, no read or write errors were reported on mine either, just CKSUM (I do seem to recall other errors listed - read or write - but they were cleared on reboot, so I cannot recall it exactly). And I would think it's possible to get no errors if it's simply a misdirected block write. Still, I would then wonder why I didn't see *2* files with errors if this is what happened to me. I guess I am saying that this may not be a memory glitch, but could also be some IDE cable issue (as mine turned out to be). See my post here: http://lists.freebsd.org/pipermail/freebsd-stable/2008-February/040355.html >> At any rate ZFS is doing the right thing by telling me - what I >> dont like is that from that point on I cant convince ZFS to ignore >> it. The data in question is video files - a bit flip here or there >> wont matter. But if ZFS reads the affected block it returns and >> I/O error and until I restore the file I have no option but to try >> and make the application skip over it. If it was UFS for example I >> would have never known, but ZFS makes a point of stopping anything >> using it - understandably, but annoyingly as well. I understand your situation, and I agree that user-control might be nice (in my case, I would not have had to tweak the ZFS code). I do think that zpool status should still reveal the error, however, even if the file read does not report it (if you have set ZFS to ignore the error). I can also imagine this could be a bit dangerous if, e.g., the user forgets this option is set. >> PS: And yes, I am now buying some ECC memory. Good practice in general - I always use ECC. There is nothing worse than silent data corruption. > I don't recall when this arrived in NV, but the failmode parameter > for storage pools has already been implemented. From zpool(1m) > failmode=wait | continue | panic > > Controls the system behavior in the event of catas- > trophic pool failure. This condition is typically a > result of a loss of connectivity to the underlying > storage device(s) or a failure of all devices within the > pool. The behavior of such an event is determined as > follows: > > wait Blocks all I/O access until the device con- > nectivity is recovered and the errors are > cleared. This is the default behavior. > > continue Returns EIO to any new write I/O requests > but allows reads to any of the remaining > healthy devices. Any write requests that > have yet to be committed to disk would be > blocked. > > panic Prints out a message to the console and gen- > erates a system crash dump. Is "wait" the default behavior now? When I had CKSUM errors, reading the file would return EIO and stop reading at that point (returning only the good data so far). Do you mean it blocks access on the errored file, or on the whole device? I've noticed the former, but not the latter. Also, I'm not sure I understand "continue". This also seems more severe than current behavior, in which access to any files other than the one(s) with errors still work. -Joe _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss