On 03/22/10 11:50 PM, Richard Elling wrote:
Look again, the checksums are different.
Whoops, you are correct, as usual. Just 6 bits out of 256 different... Last year expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a actual 4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a Last Month (obviously a different file) expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f actual 4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f Look which bits are different - digits 24, 53-56 in both cases. But comparing the bits, there's no discernible pattern. Is this an artifact of the algorithm made by one erring bit always being at the same offset?
don't forget the -V flag :-)
I didn't. As mentioned there are subsequent set-bit errors, (14 minutes later) but none for this particular incident. I'll send you the results separately since they are so puzzling. These 16 checksum failures on libdlpi.so.1 were the only fmdump -eV entries for the entire boot sequence except that it started out with one ereport.fs.zfs.data, whatever that is, for a total of exactly 17 records, 9 in 1 uS, then 8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one more checksum failure ("bad_range_sets =") then 10 minutes later, two with the set-bits error, one for each disk. That's it.
o Why is the file flagged by ZFS as fatally corrupted still accessible?
This is the part I was hoping to get answers for since AFAIK this should be impossible. Since none of this is having any operational impact, all of these issues are of interest only, but this is a bit scary!
Broken CPU, HBA, bus, memory, or power supply.
No argument there. Doesn't leave much, does it :-). Since the file itself appears to be uncorrupted, and the metadata is consistent for all 16 entries, it would seem that the checksum calculation itself is failing because it would appear in this case that everything else is OK. Is there a way to apply the fletcher2 algorithm interactively as in sum(1) or cksum(1) (i.e., outside the scope of ZFS) to see if it is in some way pattern sensitive with this CPU? Since only a small subset of files is affected, this should be easy to verify. Start a scrub to heat things up and then in parallel do checksums in a tight loop...
Transient failures are some of the most difficult to track down. Not all transient failures are random.
Indeed, although this doesn't seem to be random. The hits to libdlpi.so.1 seems to be quite reproducible as you've seen from the fmdump log, although I doubt this particular scenario will happen again. Can you think of any tools to investigate this? I suppose I could extract the checksum code from ZFS itself to build one, but that would take quite a lot of time. Is there any documentation that explains the output of fmdump -eV? What are set-bits, for example? I guess not... from man fmdump(1m) The error log file contains /Private/ telemetry informa- tion used by Sun's automated diagnosis software. ...... Each problem recorded in the fault log is identified by: o The time of its diagnosis So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait 40mS and then read another 8 copies in 1uS again? I doubt it :-). I bet it took > 1uS just to (mis)calculate the checksum (1.6GHz 16 bit cpu). Thanks -- Frank _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss