It was someone from Sun that recently asked me to repost here
about the checksum problem on mirrored drives. I was reluctant
to do so because you and Bob might start flames again, and you
did! You both sound very defensive, but of course I would never
make an unsubstantiated speculation that you might have vulnerable
hardware :-). But in case you do, please don't shoot the
messenger...

Instead of being negative, how about some conjectures of your
own about this?. here's a summary of what is happening:

An old machine with mirrored drives and a suspect mobo (maybe
not checking PCI parity) gets checksum errors on reboot and scrub.
With copies=1 it fails to repair them. With copies=2 it apparently
fixes them, but zcksummon shows quite clearly that on a scrub,
zfs finds and repairs them again on every scrub, even though
scrub shows no errors. Typically these files are system
libraries and unless you actually replace them, they are
never truly repaired.

Although I really don't think this is caused by cosmic rays,
are you also saying that PCs without ECC on memory and/or buses
will *never* experience a glitch? You obviously don't play the
lottery :-) [ZFS errors due to memory hits seem far more likely
than winning a 6 ball lottery for typical retail consumer loads]

On 09/02/09 06:54 PM, Tim Cook wrote:

Define "more systems".  How many people do you think are on 121?  And of

Absolutely no idea. Enough, though.
those, how many are on the zfs mailing list?  And of those, how many

Probably - all of them (yes, this is an unsubstantiated speculation).

have done a scrub recently to see the checksum errors?  Do you have some
proof to validate your beliefs?

If you had read the thread carefully, you would note that a scrub actually
clears the errors (but zcksummon shows that they really aren't cleared). And
doesn't the guide tell us to run scrubs frequently? I am sure we all dutifully
do so :-). I'd be quite happy to send you the proof.

REGARDLESS, had you read all the posts to this thread, you'd know you've
already been proven wrong:

Wrong about what? Reading posts before they are posted?

I have read every post most carefully. Having experienced checksum
failures on mirrored drives for 4 months now (and there's a CR
against snv115 for a similar problem), what exactly do you think I
am trying to prove, or what beliefs? After 4 months of hearing the
hardware being blamed for the checksum problem (which is easy to
reproduce against snv111b), all I'm doing is agreeing that it is
likely triggered by some kind of soft hardware glitch, we just
don't know what the glitch might be. The SPoFs on this machine
are the disk controller, the PCI bus, and memory, (and cpu, of
course). Take your pick.

FWIW it always picks on SUNWcsl (libdlpi.so.1) - 3 or 4 times now,
and more recently, /usr/share/doc/SUNWmusicbrainz/COPYING.bz2.
I am skeptical that the disk controller is picking on certain
files, so that leaves memory and the bus. Take your pick. New
files get added to the list quite infrequently. But it could also
be a pure software bug - some kind of race condition, perhaps.

    On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones <br...@servuhome.net
    <mailto:br...@servuhome.net>> wrote:
    I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives.
    Rolling back to snv_118 does not reveal any checksum errors, only
    snc_121

    So, the commodity hardware here doesn't hold up, unless Sun isn't
    validating their equipment (not likely, as these servers have had no
    hardware issues prior to this build)

Exactly. My whole point. Glad to hear that Sun hardware is as reliable as
ever!  I hope Richard's new and improved zcksummon will shed more light
on this...

Cheers -- Frank
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to