On Mar 22, 2010, at 4:21 PM, Frank Middleton wrote: > On 03/21/10 03:24 PM, Richard Elling wrote: > >> I feel confident we are not seeing a b0rken drive here. But something is >> clearly amiss and we cannot rule out the processor, memory, or controller. > > Absolutely no question of that, otherwise this list would be flooded :-). > However, the purpose of the post wasn't really to diagnose the hardware > but to ask about the behavior of ZFS under certain error conditions. > >> Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll >> go out >> on a limb and speculate that there is something in the bit pattern for that >> file that intermittently triggers a bit flip on this system. I'll also >> speculate that >> this error will not be reproducible on another system. > > Hopefully not, but you never know :-). However, this instance is different. > The example you quote shows both expected and actual checksums to be > the same.
Look again, the checksums are different. > This time the expected and actual checksums are different and > fmdump isn't flagging any bad_ranges or set-bits (the behavior you observed > is still happening, but orthogonal to this instance at different times and not > always on this file). don't forget the -V flag :-) > Since file itself is OK, and the expected checksums are always the same, > neither the file nor the metatdata appear to be corrupted, so it appears > that both are making it into memory without error. > > It would seem therefore that it is the actual checksum calculation that is > failing. But, only at boot time, the calculated (bad) checksums differ (out > of 16, 10, 3, and 3 are the same [1]) so it's not consistent. At this point it > would seem to be cpu or memory, but why only at boot? IMO it's an > old and feeble power supply under strain pushing cpu or memory to a > margin not seen during "normal" operation, which could be why diagnostics > never see anything amiss (and the importance of a good power supply). > > FWIW the machine passed everything vts could throw at it for a couple > of days. Anyone got any suggestions for more targeted diagnostics? > > There were several questions embedded in the original post, and I'm not > sure any of them have really been answered: > > o Why is the file flagged by ZFS as fatally corrupted still accessiible? > [is this new behavior from b111b vs b125?]. > > o What possible mechanism could there be for the /calculated/ checksums > of /four/ copies of just one specific file to be bad and no others? Broken CPU, HBA, bus, or memory. > o Why did this only happen at boot to just this one file which also is > peculiarly subject to the bitflips you observed, also mostly at boot > (sometimes at scrub)? I like the feeble power supply answer, but why > just this one file? Bizarre... Broken CPU, HBA, bus, memory, or power supply. > # zpool get failmode rpool > NAME PROPERTY VALUE SOURCE > rpool failmode wait default > > This machine is extremely memory limited, so I suspect that libdlpi.so.1 is > not in a cache. Certainly, a brand new copy wouldn't be, and there's no > problem writing and (much later) reading the new copy (or the old one, > for that matter). It remains to be seen if the brand new copy gets clobbered > at boot (the machine, for all it's faults, remains busily up and operational > for months at a time). Maybe I should schedule a reboot out of curiosity :-). > >> This sort of specific error analysis is possible after b125. See CR6867188 >> for more details. > > Wasn't this in b125? IIRC we upgraded to b125 for this very reason. There > certainly seems to be an overwhelming amount of data in the various logs! > > Cheers -- Frank > > [1] This could be (3+1) * 4 where in one instance all 3+1 happen to be the > same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8 > within 1uS, 40mS later, another 8, again within 1uS)? Not sure what the > fmdump timestamps mean, so it's hard to find any pattern. Transient failures are some of the most difficult to track down. Not all transient failures are random. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss