[this seems to be the question of the day, today...]

On Apr 14, 2010, at 2:57 AM, bonso wrote:

> Hi all,
> I recently experienced a disk failure on my home server and observed checksum 
> errors while resilvering the pool and on the first scrub after the resilver 
> had completed. Now everything seems fine but I'm posting this to get help 
> with calming my nerves and detect any possible future faults.
> 
> Lets start with some specs.
> OSOL 2009.06
> Intel SASUC8i (w LSI 1.30IT FW)
> Gigabyte MA770-UD3 mobo w 8GB ECC RAM
> Hitachi P7K500 harddrives
> 
> When checking the condition of my pool some days ago (yes I should make it 
> mail me if something like this happens again) one disk in my pool was labeled 
> as "Removed" with a small number of read errors, nineish I think, all other 
> disks where fine. I removed tested (DFT crashed so the disk seemed very 
> broken) replaced the drive and started a resilver.
> 
> Checking the status of the resilver everything looked good from the start but 
> when it was finished the status report looked like this:
>  pool: sasuc8i
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>       attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>       using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010
> config:
> 
>       NAME         STATE     READ WRITE CKSUM
>       sasuc8i      ONLINE       0     0     0
>         raidz2     ONLINE       0     0     0
>           c12t4d0  ONLINE       0     0     5  108K resilvered
>           c12t8d0  ONLINE       0     0     0  254G resilvered
>           c12t6d0  ONLINE       0     0     0
>           c12t7d0  ONLINE       0     0     0
>           c12t0d0  ONLINE       0     0     1  21.5K resilvered
>           c12t1d0  ONLINE       0     0     2  43K resilvered
>           c12t2d0  ONLINE       0     0     4  86K resilvered
>           c12t3d0  ONLINE       0     0     1  21.5K resilvered
> 
> errors: No known data errors
> 
> All I really cared about at this point was the "Applications are unaffected" 
> and "No known data errors" and I thought that the checksum errors might be 
> down to the failing drive (c12t5d0 failed, the controlled labeled the new 
> drive as c12t8d0) going out during a write. Then again ZFS is atomic, better 
> clear the errors and run a scrub, it came out like this: 
>  pool: sasuc8i
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>       attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>       using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010
> config:
> 
>       NAME         STATE     READ WRITE CKSUM
>       sasuc8i      ONLINE       0     0     0
>         raidz2     ONLINE       0     0     0
>           c12t4d0  ONLINE       0     0     5
>           c12t8d0  ONLINE       0     0     0
>           c12t6d0  ONLINE       0     0     0
>           c12t7d0  ONLINE       0     0     4  86K repaired
>           c12t0d0  ONLINE       0     0     1
>           c12t1d0  ONLINE       0     0     6  86K repaired
>           c12t2d0  ONLINE       0     0     4
>           c12t3d0  ONLINE       0     0     6  108K repaired
> 
> errors: No known data errors
> 
> Now I'm getting nervous. Checksum errors, some repaired others not. Am I 
> going to end up with multiple drive failures or what the * is going on here?

When I see many disks suddenly reporting errors, I suspect a common
element: HBA, cables, backplane, mobo, CPU, power supply, etc.

If you search the zfs-discuss archives you can find instances where
HBA firmware, driver issues, or firmware+driver interactions caused
such reports. Cabling and power supplies are less commonly reported.

> Ran one more scrub and everything came up roses.
> Checked smart status on the drives with checksum errors and they are fine, 
> allthough I expect only read/write errors would show up there.
> 
> I'm not sure of how to get this into a propper question but what I'm after is 
> "is this normal to be expected after a resilver and can I start breathing 
> again?". Checksum errors are as far as I can gather dodgy data on disk and 
> read/write somewhere in the physical link (more or less).

Breathing is good.  Then check your firmware releases.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to