[zfs-discuss] Many checksum errors during resilver.

Justin Daniel Meyer Mon, 21 Jun 2010 09:17:56 -0700

I've decided to upgrade my home server capacity by replacing the disks in one 
of my mirror vdevs.  The procedure appeared to work out, but during resilver, a 
couple million checksum errors were logged on the new device. I've read through 
quite a bit of the archive and searched around a bit, but can not find anything 
definitive to ease my mind on whether to proceed.



SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          mirror          DEGRADED     0     0     0
            replacing     DEGRADED   215     0     0
              c1t6d0s0/o  FAULTED      0     0     0  corrupted data
              c1t6d0      ONLINE       0     0   215  3.73M resilvered
            c1t2d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
        logs
          c8t1d0p1        ONLINE       0     0     0
        cache
          c2t1d0p2        ONLINE       0     0     0


During the resilver, the cache device and the zil were both removed for errors 
(1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the same 
OCZvertexII device.)


# zpool status -xv tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0 2.69M  539G resilvered
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c8t1d0p1  REMOVED      0     0     0
        cache
          c2t1d0p2  REMOVED      0     0     0

I cleared the errors (about 5000/GB resilvered!), removed the cache device, and 
replaced the zil partition with the whole device.  After 3 pool scrubs with no 
errors, I want to check with someone else that it appears okay to replace the 
second drive in this mirror vdev.  The one thing I have not tried is a large 
file transfer to the server, as I am also dealing with an NFS mount problem 
which popped up suspiciously close to my most recent patch update.


# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c0t0d0    ONLINE       0     0     0

errors: No known data errors


/var/adm/messages is positively over-run with these triplets/quadruplets, not 
all of which end which end up as "fatal" type.


Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       ASC: 0x8 (LUN 
communication failure), ASCQ: 0x0, FRU: 0x0
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               
Error Level: Fatal
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested 
Block: 26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
                           Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key: 
Aborted Command


In the past this kernel.notice ID has come up as "informational" for others, 
and for my case it _only_ occurred during the initial resilver.  One last point 
of interest is the new drive is the WD Green WD10EARS, and the old are WD Green 
WD6400AACS  (all of which I have tested on another system with the WD read-test 
utility).  I know these drives get their share of ridicule (and occasional 
praise/satisfaction), but I'd appreciate any thoughts on proceeding with the 
mirror upgrade. [Backups are a check.]

Justin
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Many checksum errors during resilver.

Reply via email to