On 03/21/10 03:24 PM, Richard Elling wrote:
I feel confident we are not seeing a b0rken drive here. But something is clearly amiss and we cannot rule out the processor, memory, or controller.
Absolutely no question of that, otherwise this list would be flooded :-). However, the purpose of the post wasn't really to diagnose the hardware but to ask about the behavior of ZFS under certain error conditions.
Frank reports that he sees this on the same file, /lib/libdlpi.so.1, so I'll go out on a limb and speculate that there is something in the bit pattern for that file that intermittently triggers a bit flip on this system. I'll also speculate that this error will not be reproducible on another system.
Hopefully not, but you never know :-). However, this instance is different. The example you quote shows both expected and actual checksums to be the same. This time the expected and actual checksums are different and fmdump isn't flagging any bad_ranges or set-bits (the behavior you observed is still happening, but orthogonal to this instance at different times and not always on this file). Since file itself is OK, and the expected checksums are always the same, neither the file nor the metatdata appear to be corrupted, so it appears that both are making it into memory without error. It would seem therefore that it is the actual checksum calculation that is failing. But, only at boot time, the calculated (bad) checksums differ (out of 16, 10, 3, and 3 are the same [1]) so it's not consistent. At this point it would seem to be cpu or memory, but why only at boot? IMO it's an old and feeble power supply under strain pushing cpu or memory to a margin not seen during "normal" operation, which could be why diagnostics never see anything amiss (and the importance of a good power supply). FWIW the machine passed everything vts could throw at it for a couple of days. Anyone got any suggestions for more targeted diagnostics? There were several questions embedded in the original post, and I'm not sure any of them have really been answered: o Why is the file flagged by ZFS as fatally corrupted still accessiible? [is this new behavior from b111b vs b125?]. o What possible mechanism could there be for the /calculated/ checksums of /four/ copies of just one specific file to be bad and no others? o Why did this only happen at boot to just this one file which also is peculiarly subject to the bitflips you observed, also mostly at boot (sometimes at scrub)? I like the feeble power supply answer, but why just this one file? Bizarre... # zpool get failmode rpool NAME PROPERTY VALUE SOURCE rpool failmode wait default This machine is extremely memory limited, so I suspect that libdlpi.so.1 is not in a cache. Certainly, a brand new copy wouldn't be, and there's no problem writing and (much later) reading the new copy (or the old one, for that matter). It remains to be seen if the brand new copy gets clobbered at boot (the machine, for all it's faults, remains busily up and operational for months at a time). Maybe I should schedule a reboot out of curiosity :-).
This sort of specific error analysis is possible after b125. See CR6867188 for more details.
Wasn't this in b125? IIRC we upgraded to b125 for this very reason. There certainly seems to be an overwhelming amount of data in the various logs! Cheers -- Frank [1] This could be (3+1) * 4 where in one instance all 3+1 happen to be the same. Does ZFS really read all 4 copies 4 times (by fmdump timestamp, 8 within 1uS, 40mS later, another 8, again within 1uS)? Not sure what the fmdump timestamps mean, so it's hard to find any pattern. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss