On 03/22/10 11:50 PM, Richard Elling wrote:
Look again, the checksums are different.

Whoops, you are correct, as usual. Just 6 bits out of 256 different...
Last year
expected 4a027c11b3ba4cec bf274565d5615b7b 3ef5fe61b2ed672e ec8692f7fd33094a
actual      4a027c11b3ba4cec bf274567d5615b7b 3ef5fe61b2ed672e ec86a5b3fd33094a
Last Month (obviously a different file)
expected 4b454eec8aebddb5 3b74c5235e1963ee c4489bdb2b475e76 fda3474dd1b6b63f
actual      4b454eec8aebddb5 3b74c5255e1963ee c4489bdb2b475e76 fda354c1d1b6b63f

Look which bits are different -  digits 24, 53-56 in both cases. But comparing
the bits, there's no discernible pattern. Is this an artifact of the algorithm
made by one erring bit always being at the same offset?

don't forget the -V flag :-)

I didn't. As mentioned there are subsequent set-bit errors, (14 minutes
later)  but none for this particular incident. I'll send you the results
separately since they are so puzzling. These 16 checksum failures
on libdlpi.so.1 were the only fmdump -eV entries for the entire boot
sequence except that it started out with one ereport.fs.zfs.data,
whatever that is, for a total of exactly 17 records, 9 in 1 uS, then
8 more 40 mS later, also in 1uS. Then nothing for 4 minutes, one
more checksum failure ("bad_range_sets =") then 10 minutes later,
two with the set-bits error, one for each disk. That's it.

o Why is the file flagged by ZFS as fatally corrupted still accessible?

This is the part I was hoping to get answers for since AFAIK this
should be impossible. Since none of this is having any operational
impact, all of these issues are of interest only, but this is a bit scary!

Broken CPU, HBA, bus, memory, or power supply.

No argument there. Doesn't leave much, does it :-). Since the file itself
appears to be uncorrupted, and the metadata is consistent for all 16
entries, it would seem that the checksum calculation itself is failing
because it would appear in this case that everything else is OK. Is there
a way to apply the fletcher2 algorithm interactively as in sum(1)
or cksum(1)  (i.e., outside the scope of ZFS) to see if it is in some way
pattern sensitive with this CPU? Since only a small subset of files is
affected, this should be easy to verify. Start a scrub to heat things
up and then in parallel do checksums in a tight loop...

Transient failures are some of the most difficult to track down. Not all
transient failures are random.

Indeed, although this doesn't seem to be random. The hits to libdlpi.so.1
seems to be quite reproducible as you've seen from the fmdump log,
although I doubt this particular scenario will happen again. Can you
think of any tools to investigate this? I suppose I could extract the
checksum code from ZFS itself to build one, but that would take quite
a lot of time. Is there any documentation that explains the output of
fmdump -eV? What are set-bits, for example?

I guess not...  from man fmdump(1m)

       The error log file contains /Private/  telemetry  informa-
         tion  used  by  Sun's automated diagnosis software.
......

   Each problem recorded in the fault log is identified by:

         o    The time of its diagnosis

So did ZFS really read 8 copies of libdlpi.so.1 within 1uS, wait
40mS and then read another 8 copies in 1uS again? I doubt it :-).
I bet it took > 1uS just to (mis)calculate the checksum (1.6GHz
16 bit cpu).

Thanks -- Frank
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to