Re: [zfs-discuss] CR 6880994 and pkg fix

Frank Middleton Mon, 22 Mar 2010 16:22:26 -0700

On 03/21/10 03:24 PM, Richard Elling wrote:

I feel confident we are not seeing a b0rken drive here.  But something is
clearly amiss and we cannot rule out the processor, memory, or controller.


Absolutely no question of that, otherwise this list would be flooded :-).
However, the purpose of the post wasn't really to diagnose the hardware
but to ask about the behavior of ZFS under certain error conditions.

Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll go 
out
on a limb and speculate that there is something in the bit pattern for that
file that intermittently triggers a bit flip on this system. I'll also 
speculate that
this error will not be reproducible on another system.


Hopefully not, but you never know :-). However, this instance is different.
The example you quote shows both expected and actual checksums to be
the same. This time the expected and actual checksums are different and
fmdump isn't flagging any bad_ranges or set-bits (the behavior you observed
is still happening, but orthogonal to this instance at different times and not
always on this file).

Since file itself is OK, and the expected checksums are always the same,
neither the file nor the metatdata appear to be corrupted, so it appears
that both are making it into memory without error.

It would seem therefore that it is the actual checksum calculation that is
failing. But, only at boot time, the calculated (bad) checksums differ (out
of 16, 10, 3, and 3 are the same [1]) so it's not consistent. At this point it
would seem to be cpu or memory, but why only at boot? IMO it's an
old and feeble power supply under strain pushing cpu or memory to a
margin not seen during "normal" operation, which could be why diagnostics
never see anything amiss (and the importance of a good power supply).

FWIW the machine passed everything vts could throw at it for a couple
of days. Anyone got any suggestions for more targeted diagnostics?

There were several questions embedded in the original post, and I'm not
sure any of them have really been answered:

o Why is the file flagged by ZFS as fatally corrupted still accessiible?
   [is this new behavior from b111b vs b125?].

o What possible mechanism could there be for the /calculated/ checksums
   of /four/ copies of just one specific file to be bad and no others?

o Why did this only happen at boot to just this one file which also is
   peculiarly subject to the bitflips you observed, also mostly at boot
  (sometimes at scrub)? I like the feeble power supply answer, but why
  just this one file? Bizarre...

# zpool get  failmode rpool
NAME   PROPERTY  VALUE     SOURCE
rpool  failmode  wait      default

This machine is extremely memory limited, so I suspect that libdlpi.so.1 is
not in a cache. Certainly, a brand new copy wouldn't be, and there's no
problem writing and (much later) reading the new copy (or the old one,
for that matter). It remains to be seen if the brand new copy gets clobbered
at boot (the machine, for all it's faults, remains busily up and operational
for months at a time). Maybe I should schedule a reboot out of curiosity :-).

This sort of specific error analysis is possible after b125. See CR6867188
for more details.


Wasn't this in b125? IIRC we upgraded to b125 for this very reason. There
certainly seems to be an overwhelming amount of data in the various logs!

Cheers -- Frank

[1]  This could be (3+1) * 4 where in one instance all 3+1 happen to be the
same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8
within 1uS, 40mS later, another 8,  again within 1uS)? Not sure what the
fmdump timestamps mean, so it's hard to find any pattern.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CR 6880994 and pkg fix

Reply via email to