Re: [zfs-discuss] Many checksum errors during resilver.

Cindy Swearingen Mon, 21 Jun 2010 09:50:10 -0700

Hi Justin,

This looks like an older Solaris 10 release. If so, this looks like
a zpool status display bug, where it looks like the checksum errors
are occurring on the replacement device, but they are not.


I would review the steps described in the hardware section of the ZFS
troubleshooting wiki to confirm that the new disk is working as
expected:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Then, follow steps in the Notify FMA That Device Replacement is Complete
section to reset FMA. Then, start monitoring the replacement device
with fmdump to see if any new activity occurs on this device.

Thanks,

Cindy


On 06/21/10 10:21, Justin Daniel Meyer wrote:

I've decided to upgrade my home server capacity by replacing the disks in one 
of my mirror vdevs.  The procedure appeared to work out, but during resilver, a 
couple million checksum errors were logged on the new device. I've read through 
quite a bit of the archive and searched around a bit, but can not find anything 
definitive to ease my mind on whether to proceed.


SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          mirror          DEGRADED     0     0     0
            replacing     DEGRADED   215     0     0
              c1t6d0s0/o  FAULTED      0     0     0  corrupted data
              c1t6d0      ONLINE       0     0   215  3.73M resilvered
            c1t2d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
        logs
          c8t1d0p1        ONLINE       0     0     0
        cache
          c2t1d0p2        ONLINE       0     0     0


During the resilver, the cache device and the zil were both removed for errors 
(1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the same 
OCZvertexII device.)


# zpool status -xv tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0 2.69M  539G resilvered
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c8t1d0p1  REMOVED      0     0     0
        cache
          c2t1d0p2  REMOVED      0     0     0

I cleared the errors (about 5000/GB resilvered!), removed the cache device, and 
replaced the zil partition with the whole device.  After 3 pool scrubs with no 
errors, I want to check with someone else that it appears okay to replace the 
second drive in this mirror vdev.  The one thing I have not tried is a large 
file transfer to the server, as I am also dealing with an NFS mount problem 
which popped up suspiciously close to my most recent patch update.


# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c0t0d0    ONLINE       0     0     0

errors: No known data errors


/var/adm/messages is positively over-run with these triplets/quadruplets, not all of 
which end which end up as "fatal" type.


Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       ASC: 0x8 (LUN 
communication failure), ASCQ: 0x0, FRU: 0x0
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Fatal
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command


In the past this kernel.notice ID has come up as "informational" for others, 
and for my case it _only_ occurred during the initial resilver.  One last point of 
interest is the new drive is the WD Green WD10EARS, and the old are WD Green WD6400AACS  
(all of which I have tested on another system with the WD read-test utility).  I know 
these drives get their share of ridicule (and occasional praise/satisfaction), but I'd 
appreciate any thoughts on proceeding with the mirror upgrade. [Backups are a check.]

Justin
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Many checksum errors during resilver.

Reply via email to