[this seems to be the question of the day, today...] On Apr 14, 2010, at 2:57 AM, bonso wrote:
> Hi all, > I recently experienced a disk failure on my home server and observed checksum > errors while resilvering the pool and on the first scrub after the resilver > had completed. Now everything seems fine but I'm posting this to get help > with calming my nerves and detect any possible future faults. > > Lets start with some specs. > OSOL 2009.06 > Intel SASUC8i (w LSI 1.30IT FW) > Gigabyte MA770-UD3 mobo w 8GB ECC RAM > Hitachi P7K500 harddrives > > When checking the condition of my pool some days ago (yes I should make it > mail me if something like this happens again) one disk in my pool was labeled > as "Removed" with a small number of read errors, nineish I think, all other > disks where fine. I removed tested (DFT crashed so the disk seemed very > broken) replaced the drive and started a resilver. > > Checking the status of the resilver everything looked good from the start but > when it was finished the status report looked like this: > pool: sasuc8i > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010 > config: > > NAME STATE READ WRITE CKSUM > sasuc8i ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c12t4d0 ONLINE 0 0 5 108K resilvered > c12t8d0 ONLINE 0 0 0 254G resilvered > c12t6d0 ONLINE 0 0 0 > c12t7d0 ONLINE 0 0 0 > c12t0d0 ONLINE 0 0 1 21.5K resilvered > c12t1d0 ONLINE 0 0 2 43K resilvered > c12t2d0 ONLINE 0 0 4 86K resilvered > c12t3d0 ONLINE 0 0 1 21.5K resilvered > > errors: No known data errors > > All I really cared about at this point was the "Applications are unaffected" > and "No known data errors" and I thought that the checksum errors might be > down to the failing drive (c12t5d0 failed, the controlled labeled the new > drive as c12t8d0) going out during a write. Then again ZFS is atomic, better > clear the errors and run a scrub, it came out like this: > pool: sasuc8i > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010 > config: > > NAME STATE READ WRITE CKSUM > sasuc8i ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c12t4d0 ONLINE 0 0 5 > c12t8d0 ONLINE 0 0 0 > c12t6d0 ONLINE 0 0 0 > c12t7d0 ONLINE 0 0 4 86K repaired > c12t0d0 ONLINE 0 0 1 > c12t1d0 ONLINE 0 0 6 86K repaired > c12t2d0 ONLINE 0 0 4 > c12t3d0 ONLINE 0 0 6 108K repaired > > errors: No known data errors > > Now I'm getting nervous. Checksum errors, some repaired others not. Am I > going to end up with multiple drive failures or what the * is going on here? When I see many disks suddenly reporting errors, I suspect a common element: HBA, cables, backplane, mobo, CPU, power supply, etc. If you search the zfs-discuss archives you can find instances where HBA firmware, driver issues, or firmware+driver interactions caused such reports. Cabling and power supplies are less commonly reported. > Ran one more scrub and everything came up roses. > Checked smart status on the drives with checksum errors and they are fine, > allthough I expect only read/write errors would show up there. > > I'm not sure of how to get this into a propper question but what I'm after is > "is this normal to be expected after a resilver and can I start breathing > again?". Checksum errors are as far as I can gather dodgy data on disk and > read/write somewhere in the physical link (more or less). Breathing is good. Then check your firmware releases. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss