[zfs-discuss] zpool CKSUM errors since drive replace

Matthew Angelo Mon, 13 Oct 2008 17:48:38 -0700

After performing the following steps in exact order, I am now seeing CKSUM
errors in my zpool.   I've never seen any Checksum errors before in the
zpool.


1. Performing running setup (RAIDZ 7D+1P) - 8x 1TB.  Solaris 10 Update 3
x86.
2. Disk 6 (c6t2d0) was dying,  $(zpool status) read errors, and device
errors in /var/adm/messages.
3. Additional to replacing this disk, I thought I would give myself a
challenge and upgrade to Solaris 10 and change my CPU/Motherboard.
 > 3.1 CPU went to AthlonXP 3500+ from an Athlon FX-51
 > 3.2 Motherboard went to Asus A8N-SLI Premium from Asus SK8N.
 > 3.3 Memory stayed the same at 2GB ECC DRR (all other components
identical).
 > 3.4 And finally, I replaced the failed Disk 6.
4. Solaris 10 U5 x86 Install was fine without a problem.   zpool imported
fine (obviously DEGRADED).
5. zpool replace worked without a problem and it resilvered with 0 read,
write or cksum errors.
6. After zpool replace, zfs recommended I run zfs upgrade to upgrade from
zfs3 to zfs4, which I have done.

This is where the problem starts to appear.

The Upgrade was fine, however immediately after the upgrade I ran a scrub
and I noticed a very high number of cksum errors on the newly replaced disk
6 (now c4t2d0, previously before reinstall c6t2d0).

Here is the progress of the scrub and you can see how the cksum is quickly,
and constantly increasing:

[/root][root]# date
Fri Oct 10 00:19:16 EST 2008
[root][root]# zpool status -v
  pool: rzdata
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool
replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 7.34% done, 6h10m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rzdata      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0   390
            c4t3d0  ONLINE       0     0     0

errors: No known data errors
[/root][root]# date
Fri Oct 10 00:23:12 EST 2008
[root][root]# zpool status -v
  pool: rzdata
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool
replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 8.01% done, 6h6m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rzdata      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     1
            c3t3d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c4t2d0  ONLINE       0     0   768
            c4t3d0  ONLINE       0     0     0
[/root][root]# date
Fri Oct 10 00:29:44 EST 2008
[/root][root]# zpool status -v
  pool: rzdata
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.
An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool
replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 9.88% done, 5h57m to go
config:
        NAME        STATE     READ WRITE CKSUM
        rzdata      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     2
            c3t3d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c4t2d0  ONLINE       0     0   931
            c4t3d0  ONLINE       0     0     1

It eventually finished with 6.4K CKSUM errors against c4t2d0 and an average
of sub 5 errors on the remaining disks.   I was not (and still not)
convinced it's a physical hardware problem and my initial thoughts was that
there is/was(?) a bug with zfs and zpool upgrade a mounted and running
zpool.  So to be pedantic, I rebooted the server, and initiated another
scrub.

This is the outcome of this scrub:
[/root][root]# zpool status -v
  pool: rzdata
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Mon Oct 13 09:42:41 2008
config:

        NAME        STATE     READ WRITE CKSUM
        rzdata      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     1
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     1
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0    22
            c4t3d0  ONLINE       0     0     2


The next avenue I plan on investigating is running a complete memtest86 test
again the hardware to ensure the memory isn't occasionally returning garage
(even though it's ECC).

So this is where I stand.  I'd like to ask zfs-discuss if they've seen any
ZIL/Replay style bugs associated with u3/u5 x86?  Again, I'm confident in my
hardware, and /var/adm/messages is showing no warnings/errors.

Thank You

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool CKSUM errors since drive replace

Reply via email to