On Wed, Sep 2, 2009 at 6:27 AM, Frank Middleton<f.middle...@apogeect.com> wrote: > On 09/02/09 05:40 AM, Henrik Johansson wrote: > >> For those of us which have already upgraded and written data to our >> raidz pools, are there any risks of inconsistency, wrong checksums in >> the pool? Is there a bug id? > > This may not be a new problem insofar as it may also affect mirrors. > As part of the ancient "mirrored drives should not have checksum > errors thread", I used Richard Elling's amazing zcksummon script > http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon > to help diagnose this (thanks, Richard, for all your help). > > The bottom line is that hardware glitches (as found on cheap PCs > without ECC on buses and memory) can put ZFS into a mode where it > detects bogus checksum errors. If you set copies=2, it seems to > always be able to repair them, but they are never actually repaired. > Every time you scrub, it finds a checksum error on the affected file(s) > and it pretends to repair it (or may fail if you have copies=1 set). > > Note: I have not tried this on raidz, only mirrors, where it is > highly reproducible. It would be really interesting to see if > raidz gets results similar to the mirror case when running zcksummon. > Note I have NEVER had this problem on SPARC, only on certain > bargain-basement PCs (used as X-Terminals) which as it turns out > have mobos notorious for not detecting bus parity errors. > > If this is the same problem, you can certainly mitigate it by > setting copies=2 and actually copying the files (e.g., by > promoting a snapshot, which I believe will do this - can someone > confirm?). My guess is that snv121 has done something to make > the problem more likely to occur, but the problem itself is > quite old (predates snv100). Could you share with us some details > of your hardware, especially how much memory and if it has ECC > orbus parity? > > Cheers -- Frank > > On 09/02/09 05:40 AM, Henrik Johansson wrote: >> >> Hi Adam, >> >> >> On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote: >> >>> Hi James, >>> >>> After investigating this problem a bit I'd suggest avoiding deploying >>> RAID-Z >>> until this issue is resolved. I anticipate having it fixed in build 124. >> > >> Regards >> >> Henrik >> http://sparcv9.blogspot.com <http://sparcv9.blogspot.com/> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn't hold up, unless Sun isn't validating their equipment (not likely, as these servers have had no hardware issues prior to this build) -- Brent Jones br...@servuhome.net _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss