Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

Tim Fri, 17 Apr 2009 10:38:39 -0700

On Fri, Apr 17, 2009 at 12:25 PM, Richard Elling
<richard.ell...@gmail.com>wrote:


> Drew Balfour wrote:
>
>> Now I wonder where that error came from. It was just a single checksum
>>>>>> error. It couldn't go away with an earlier scrub, and seemingly left no
>>>>>> traces of badness on the drive. Something serious? At least it looks a 
>>>>>> tad
>>>>>> contradictory: "Applications are unaffected.", it is unrecoverable, and 
>>>>>> once
>>>>>> cleared, there is no error left.
>>>>>>
>>>>>
>> What happens if you rescrub the pool after clearing the errors? If zfs has
>> reused whatever was causing the issue, then it shouldn't be surprising that
>> the error will show up again.
>>
>
> Are you assuming that bad disk blocks are returned to the free pool?
> This is more of a problem for file systems with pre-allocated metadata,
> such as UFS.  In UFS, if a sector in a superblock copy goes bad, it
> will still be reused.  In ZFS, metadata is COW and redundant, so
> there is no forced re-use of disk blocks (except for the uberblocks
> which are 4x redundant and use 128-slot circular queues).
>
>
>>  Could you propose alternate wording?
>>>
>>
>> My $.02, but the wording in the error message is rather obtuse.
>> "Unrecoverable error" indicates to me that something was lost; technically
>> this is true, but zfs was able to replicate the data from another source.
>> This is not all that clear from the error:
>>
>> status: One or more devices has experienced an unrecoverable error.  An
>>        attempt was made to correct the error.  Applications are
>> unaffected.
>>
>> This doesn't indicate if the attempt was successful or not. We all know it
>> was, because if it wasn't, we'd
>> (a) see another error instead and/or (b) see something other than "errors:
>> No known data errors". But, unless you know zfs well enough to make that
>> leap, you're left wondering what actually happened.
>>
>> Granted, the 'verbose' error page (http://www.sun.com/msg/ZFS-8000-9P)
>> does a much better job of explaining. However, confusing terse error
>> messages are never good, and asking the user to go look stuff up in order to
>> understand isn't good either. Also, the verbose error page also doesn't
>> explain that despite not having  a replicated configuration, metadata is
>> replicated and so errors can be recovered from a seemingly 'unrecoverable'
>> state.
>>
>> Does anyone know why it's "applications" and not "data"?
>>
>> Perhaps something like:
>>
>> status: One or more devices has experienced an error. A successful attempt
>> to
>>        correct the error was made using a replicated copy of the data.
>>        Data on the pool is unaffected.
>>
>
> I think this is on the right track.  But the repair method, "replicated
> copy
> of the data," should be more vague because there are other ways to
> repair data.
>
> Does anyone else have better wording?
> -- richard
>
>

Unless you want to have a different response for each of the repair methods,
I'd just drop that part:

status: One or more devices has experienced an error. The error has been
       automatically corrected by zfs.
       Data on the pool is unaffected.


I suppose you could do a "for more information please contact Sun" or
something along those lines as well?

--Tim

(my reply to all skills have been suffering lately, sorry Richard).

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

Reply via email to