On Oct 19, 2011, at 1:52 PM, Richard Elling wrote:
> On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote:
>
>>> From: [email protected] [mailto:zfs-discuss-
>>> [email protected]] On Behalf Of Tim Cook
>>>
>>> I had and have redundant storage, it has *NEVER* automatically fixed
>>> it. You're the first person I've heard that has had it automatically fix
>> it.
>>
>> That's probably just because it's normal and expected behavior to
>> automatically fix it - I always have redundancy, and every cksum error I
>> ever find is always automatically fixed. I never tell anyone here because
>> it's normal and expected.
>
> Yes, and in fact the automated tests for ZFS developers intentionally
> corrupts data
> so that the repair code can be tested. Also, the same checksum code is used
> to
> calculate the checksum when writing and reading.
>
>> If you have redundancy, and cksum errors, and it's not automatically fixed,
>> then you should report the bug.
>
> For modern Solaris-based implementations, each checksum mismatch that is
> repaired reports the bitmap of the corrupted vs expected data. Obviously, if
> the
> data cannot be repaired, you cannot know the expected data, so the error is
> reported without identification of the broken bits.
>
> In the archives, you can find reports of recoverable and unrecoverable errors
> attributed to:
> 1. ZFS software (rare, but a bug a few years ago mishandled a raidz
> case)
> 2. SAN switch firmware
> 3. "Hardware" RAID array firmware
> 4. Power supplies
> 5. RAM
> 6. HBA
> 7. PCI-X bus
> 8. BIOS settings
> 9. CPU and chipset errata
>
> Personally, I've seen all of the above except #7, because PCI-X hardware is
> hard to find now.
I've seen #7. I have some PCI-X hardware that is flaky in my home lab. ;-)
There was a case of #1 not very long ago, but it was a difficult to trigger
race and is fixed in illumos and I presume other derivatives (including
NexentaStor).
- Garrett
>
> If consistently see unrecoverable data from a system that has protected data,
> then
> there may be an issue with a part of the system that is a single point of
> failure. Very,
> very, very few x86 systems are designed with no SPOF.
> -- richard
>
> --
>
> ZFS and performance consulting
> http://www.RichardElling.com
> VMworld Copenhagen, October 17-20
> OpenStorage Summit, San Jose, CA, October 24-27
> LISA '11, Boston, MA, December 4-9
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss