Thanks to everyone who made suggestions! This machine has run memtest for a week and VTS for several days with no errors. It does seem that the problem is probably in the CPU cache.
On 03/24/10 10:07 AM, Damon Atkins wrote:
You could try copying the file to /tmp (ie swap/ram) and do a continues loop of checksums
On a variation of your suggestion, I implemented a bash script that applies sha1sum 10,000 times with a pause of 0.1S between each attempt, and tests the result against what seemed to be the correct result. sha1sum on /lib/libdlpi.so.1 resulted in 11% of incorrect results sha1sum on /tmp/libdlpi.so.1 resulted in 5 failures out of 10,000 sha1sum on /lib/libpam.so.1 resulted in zero errors in 10,000 sha1sum on /tmp/libpam.so.1ditto. So what we have is a pattern sensitive failure that is also sensitive to how busy the cpu is (and doesn't fail running VTS). md5sum and sha256sum produced similar results, and presumably so would fletcher2. To get really meaningful results, the machine should be otherwise idle (but then, maybe it wouldn't fail). Is anyone willing to speculate (or have any suggestions for further experiments) about what failure mode could cause a checksum calculation to be pattern sensitive and also thousands of times more likely to fail if read from disk vs. tmpfs? FWIW the failures are pretty consistent, mostly but not always producing the same bad checksum. So at boot, the cpu is busy, increasing the probability of this pattern sensitive failure, and this one time it failed on every read of /lib/libdlpi.so.1. With copies=1 this was twice as likely to happen, and when it did ZFS returned an error on any attempt to read the file. With copies=2 in this case it doesn't return an error when attempting to read. Also there were no set-bit errors this time, but then I have no idea what a set-bit error is. On 03/24/10 12:32 PM, Richard Elling wrote:
Clearly, fletcher2 identified the problem.
Ironically, on this hardware it seems it created the problem :-). However you have been vindicated - it was a pattern sensitive problem as you have long suggested it might be. So: that the file is still readable is a mystery, but how it became to be flagged as bad in ZFS isn't, any more. Cheers -- Frank _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss