It was someone from Sun that recently asked me to repost here about the checksum problem on mirrored drives. I was reluctant to do so because you and Bob might start flames again, and you did! You both sound very defensive, but of course I would never make an unsubstantiated speculation that you might have vulnerable hardware :-). But in case you do, please don't shoot the messenger...
Instead of being negative, how about some conjectures of your own about this?. here's a summary of what is happening: An old machine with mirrored drives and a suspect mobo (maybe not checking PCI parity) gets checksum errors on reboot and scrub. With copies=1 it fails to repair them. With copies=2 it apparently fixes them, but zcksummon shows quite clearly that on a scrub, zfs finds and repairs them again on every scrub, even though scrub shows no errors. Typically these files are system libraries and unless you actually replace them, they are never truly repaired. Although I really don't think this is caused by cosmic rays, are you also saying that PCs without ECC on memory and/or buses will *never* experience a glitch? You obviously don't play the lottery :-) [ZFS errors due to memory hits seem far more likely than winning a 6 ball lottery for typical retail consumer loads] On 09/02/09 06:54 PM, Tim Cook wrote:
Define "more systems". How many people do you think are on 121? And of
Absolutely no idea. Enough, though.
those, how many are on the zfs mailing list? And of those, how many
Probably - all of them (yes, this is an unsubstantiated speculation).
have done a scrub recently to see the checksum errors? Do you have some proof to validate your beliefs?
If you had read the thread carefully, you would note that a scrub actually clears the errors (but zcksummon shows that they really aren't cleared). And doesn't the guide tell us to run scrubs frequently? I am sure we all dutifully do so :-). I'd be quite happy to send you the proof.
REGARDLESS, had you read all the posts to this thread, you'd know you've already been proven wrong:
Wrong about what? Reading posts before they are posted? I have read every post most carefully. Having experienced checksum failures on mirrored drives for 4 months now (and there's a CR against snv115 for a similar problem), what exactly do you think I am trying to prove, or what beliefs? After 4 months of hearing the hardware being blamed for the checksum problem (which is easy to reproduce against snv111b), all I'm doing is agreeing that it is likely triggered by some kind of soft hardware glitch, we just don't know what the glitch might be. The SPoFs on this machine are the disk controller, the PCI bus, and memory, (and cpu, of course). Take your pick. FWIW it always picks on SUNWcsl (libdlpi.so.1) - 3 or 4 times now, and more recently, /usr/share/doc/SUNWmusicbrainz/COPYING.bz2. I am skeptical that the disk controller is picking on certain files, so that leaves memory and the bus. Take your pick. New files get added to the list quite infrequently. But it could also be a pure software bug - some kind of race condition, perhaps.
On Wed, Sep 2, 2009 at 11:15 AM, Brent Jones <br...@servuhome.net <mailto:br...@servuhome.net>> wrote: I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn't hold up, unless Sun isn't validating their equipment (not likely, as these servers have had no hardware issues prior to this build)
Exactly. My whole point. Glad to hear that Sun hardware is as reliable as ever! I hope Richard's new and improved zcksummon will shed more light on this... Cheers -- Frank _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss