After performing the following steps in exact order, I am now seeing CKSUM errors in my zpool. I've never seen any Checksum errors before in the zpool.
1. Performing running setup (RAIDZ 7D+1P) - 8x 1TB. Solaris 10 Update 3 x86. 2. Disk 6 (c6t2d0) was dying, $(zpool status) read errors, and device errors in /var/adm/messages. 3. Additional to replacing this disk, I thought I would give myself a challenge and upgrade to Solaris 10 and change my CPU/Motherboard. > 3.1 CPU went to AthlonXP 3500+ from an Athlon FX-51 > 3.2 Motherboard went to Asus A8N-SLI Premium from Asus SK8N. > 3.3 Memory stayed the same at 2GB ECC DRR (all other components identical). > 3.4 And finally, I replaced the failed Disk 6. 4. Solaris 10 U5 x86 Install was fine without a problem. zpool imported fine (obviously DEGRADED). 5. zpool replace worked without a problem and it resilvered with 0 read, write or cksum errors. 6. After zpool replace, zfs recommended I run zfs upgrade to upgrade from zfs3 to zfs4, which I have done. This is where the problem starts to appear. The Upgrade was fine, however immediately after the upgrade I ran a scrub and I noticed a very high number of cksum errors on the newly replaced disk 6 (now c4t2d0, previously before reinstall c6t2d0). Here is the progress of the scrub and you can see how the cksum is quickly, and constantly increasing: [/root][root]# date Fri Oct 10 00:19:16 EST 2008 [root][root]# zpool status -v pool: rzdata state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 7.34% done, 6h10m to go config: NAME STATE READ WRITE CKSUM rzdata ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 390 c4t3d0 ONLINE 0 0 0 errors: No known data errors [/root][root]# date Fri Oct 10 00:23:12 EST 2008 [root][root]# zpool status -v pool: rzdata state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 8.01% done, 6h6m to go config: NAME STATE READ WRITE CKSUM rzdata ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 1 c3t3d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 2 c4t2d0 ONLINE 0 0 768 c4t3d0 ONLINE 0 0 0 [/root][root]# date Fri Oct 10 00:29:44 EST 2008 [/root][root]# zpool status -v pool: rzdata state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 9.88% done, 5h57m to go config: NAME STATE READ WRITE CKSUM rzdata ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 2 c3t3d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 2 c4t2d0 ONLINE 0 0 931 c4t3d0 ONLINE 0 0 1 It eventually finished with 6.4K CKSUM errors against c4t2d0 and an average of sub 5 errors on the remaining disks. I was not (and still not) convinced it's a physical hardware problem and my initial thoughts was that there is/was(?) a bug with zfs and zpool upgrade a mounted and running zpool. So to be pedantic, I rebooted the server, and initiated another scrub. This is the outcome of this scrub: [/root][root]# zpool status -v pool: rzdata state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Mon Oct 13 09:42:41 2008 config: NAME STATE READ WRITE CKSUM rzdata ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 1 c3t2d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 1 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 22 c4t3d0 ONLINE 0 0 2 The next avenue I plan on investigating is running a complete memtest86 test again the hardware to ensure the memory isn't occasionally returning garage (even though it's ECC). So this is where I stand. I'd like to ask zfs-discuss if they've seen any ZIL/Replay style bugs associated with u3/u5 x86? Again, I'm confident in my hardware, and /var/adm/messages is showing no warnings/errors. Thank You
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss