On Dec 21, 2011, at 11:45 AM, Gareth de Vaux wrote:

> Hi guys, after a scrub my raidz array status showed:
> 
> # zpool status
>  pool: pool
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>        attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>        using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scan: scrub repaired 85.5K in 1h21m with 0 errors on Mon Dec 19 06:24:25 2011
> config:
> 
>        NAME        STATE     READ WRITE CKSUM
>        pool        ONLINE       0     0     0
>          raidz1-0  ONLINE       0     0     0
>            ad18    ONLINE       0     0     1
>            ad19    ONLINE       0     0     0
>            ad10    ONLINE       0     0     1
>            ad4     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 
> I assume the checksum counts are current and irreconcilable. (Why does
> the scan say 'repaired with 0 errors' then?).
> 
> What should one do at this point?

Be happy. Dance a jig. Buy a lottery ticket.
Notice: scrub repaired 85.5K in 1h21m with 0 errors on Mon Dec 19 06:24:25 2011
ZFS found corruption and fixed it.

> 
> I rebooted and ran another scrub, this time it came up with 0 errors
> and 0 checksum counts, what does that mean?

ZFS found corruption and fixed it.

> 
> I then backed up the array, kicked out ad18 and resilvered it from scratch:

oops... tempting the fates?
Transient errors do occur, frequently. Not all errors are persistent or fatal.
Given the information presented here, IMHO, this system did not warrant further 
action.

> 
> # zpool status
>  pool: pool
> state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>        corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>        entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scan: resilvered 218G in 1h25m with 14 errors on Wed Dec 21 14:48:47 2011
> config:
> 
>        NAME             STATE     READ WRITE CKSUM
>        pool             DEGRADED     0     0    14
>          raidz1-0       DEGRADED     0     0    28
>            replacing-0  OFFLINE      0     0     0
>              ad18/old   OFFLINE      0     0     0
>              ad18       ONLINE       0     0     0
>            ad19         ONLINE       0     0     0
>            ad10         ONLINE       0     0     0
>            ad4          ONLINE       0     0     0
> 
> errors: 11 data errors, use '-v' for a list
> 
> 
> and 'zpool status -v' gives me a list of affected files.
> 
> I assume I delete those files, then follow the same procedure on ad10?
> 
> 
> # uname -a
> FreeBSD file 8.2-STABLE FreeBSD 8.2-STABLE #0: Sat Nov 12 17:51:22 SAST 2011  
>    root@file:/usr/obj/usr/src/sys/COWNEL  amd64
> 
> ZFS filesystem version 5
> ZFS storage pool version 28
> 
> 
> ps. I did get 1 disk alert in the logs during this whole process, half an 
> hour before resilvering:
> 
> Dec 21 12:41:48 file kernel: ad10: WARNING - READ_DMA48 UDMA ICRC error 
> (retrying request) LBA=306763504
> Dec 21 12:41:48 file kernel: ad10: FAILURE - READ_DMA48 
> status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=306763504

This appears to be a [S]ATA error generated by the drive. If LBA 306763504 is a 
legal LBA, then
this can be one of the factors contributing to the original checksum error.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com














_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to