On Wed, Sep 2, 2009 at 6:27 AM, Frank Middleton<f.middle...@apogeect.com> wrote:
> On 09/02/09 05:40 AM, Henrik Johansson wrote:
>
>> For those of us which have already upgraded and written data to our
>> raidz pools, are there any risks of inconsistency, wrong checksums in
>> the pool? Is there a bug id?
>
> This may not be a new problem insofar as it may also affect mirrors.
> As part of the ancient "mirrored drives should not have checksum
> errors thread", I used Richard Elling's amazing zcksummon script
> http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
> to help diagnose this (thanks, Richard, for all your help).
>
> The bottom line is that hardware glitches (as found on cheap PCs
> without ECC on buses and memory) can put ZFS into a mode where it
> detects bogus checksum errors. If you set copies=2, it seems to
> always be able to repair them, but they are never actually repaired.
> Every time you scrub, it finds a checksum error on the affected file(s)
> and it pretends to repair it (or may fail if you have copies=1 set).
>
> Note: I have not tried this on raidz, only mirrors, where it is
> highly reproducible. It would be really interesting to see if
> raidz gets results similar to the mirror case when running zcksummon.
> Note I have NEVER had this problem on SPARC, only on certain
> bargain-basement PCs (used as X-Terminals) which as it turns out
> have mobos notorious for not detecting bus parity errors.
>
> If this is the same problem, you can certainly mitigate it by
> setting copies=2 and actually copying the files (e.g., by
> promoting a snapshot, which I believe will do this - can someone
> confirm?). My guess is that snv121 has done something to make
> the problem more likely to occur, but the problem itself is
> quite old (predates snv100). Could you share with us some details
> of your hardware, especially how much memory and if it has ECC
> orbus parity?
>
> Cheers -- Frank
>
> On 09/02/09 05:40 AM, Henrik Johansson wrote:
>>
>> Hi Adam,
>>
>>
>> On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote:
>>
>>> Hi James,
>>>
>>> After investigating this problem a bit I'd suggest avoiding deploying
>>> RAID-Z
>>> until this issue is resolved. I anticipate having it fixed in build 124.
>>
>
>> Regards
>>
>> Henrik
>> http://sparcv9.blogspot.com <http://sparcv9.blogspot.com/>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives.
Rolling back to snv_118 does not reveal any checksum errors, only snc_121

So, the commodity hardware here doesn't hold up, unless Sun isn't
validating their equipment (not likely, as these servers have had no
hardware issues prior to this build)


-- 
Brent Jones
br...@servuhome.net
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to