Howdy, I have at several times had issues with consumer grade PC hardware and ZFS not getting along. The problem is not the disks but the fact I dont have ECC and end to end checking on the datapath. What is happening is that random memory errors and bit flips are written out to disk and when read back again ZFS reports it as a checksum failure:
pool: myth state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM myth ONLINE 0 0 48 raidz1 ONLINE 0 0 48 c7t1d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /myth/tv/1504_20080216203700.mpg /myth/tv/1509_20080217192700.mpg Note there are no disk errors, just entire RAID errors. I get the same thing on a mirror pool where both sides of the mirror have identical errors. All I can assume is that it was corrupted after the checksum was calculated and flushed to disk like that. In the past it was a motherboard capacitor that had popped - but it was enough to generate these errors under load. At any rate ZFS is doing the right thing by telling me - what I dont like is that from that point on I cant convince ZFS to ignore it. The data in question is video files - a bit flip here or there wont matter. But if ZFS reads the affected block it returns and I/O error and until I restore the file I have no option but to try and make the application skip over it. If it was UFS for example I would have never known, but ZFS makes a point of stopping anything using it - understandably, but annoyingly as well. What I would like to see is an option to ZFS in the style of the 'onerror' for UFS i.e the ability to tell ZFS to join fight club - let what doesnt matter truely slide. For example: zfs set erroraction=[iofail|log|ignore] This would default to the current action of "iofail" but in the event you wanted to try and recover or repair data, you could set log to say generate an FMA event that there is bad checksums, or ignore, to get on with your day. As mentioned, I see this as mostly an option to help repair data after the issue is identified or repaired. Of course its data specific, but if the application can allow it or handle it, why should ZFS get in the way? Just a thought. Cheers, Adrian PS: And yes, I am now buying some ECC memory. This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss