Hi all, On Thursday 26 November 2009 17:38:42 Cindy Swearingen wrote: > Did anything about this configuration change before the checksum errors > occurred? >
No, This machine is running in this configuration for a couple of weeks now > The errors on c1t6d0 are severe enough that your spare kicked in. > Yes and overnight more spare would have kicked in if available: s13:~# zpool status pool: atlashome state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 5h46m with 0 errors on Thu Nov 26 15:55:22 2009 config: NAME STATE READ WRITE CKSUM atlashome DEGRADED 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 1 c6t1d0 ONLINE 0 0 6 c7t1d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 3 c6t2d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 1 c0t3d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 1 c1t5d0 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 1 c0t6d0 ONLINE 0 0 0 raidz1 DEGRADED 0 0 0 spare DEGRADED 0 0 0 c1t6d0 DEGRADED 6 0 17 too many errors c8t7d0 ONLINE 0 0 0 130G resilvered c5t6d0 ONLINE 0 0 0 c6t6d0 DEGRADED 0 0 41 too many errors c7t6d0 DEGRADED 1 0 14 too many errors c8t6d0 ONLINE 0 0 1 raidz1 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 1 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 logs c6t4d0 ONLINE 0 0 0 spares c8t7d0 INUSE currently in use errors: No known data errors > You can use the fmdump -eV command to review the disk errors that FMA has > detected. This command can generate a lot of output but you can see if > the checksum errors on the disks are transient or if they occur repeatedly. > Hmm, The output does not seem to stop. After about 1.3 GB of file size I stopped it. There seem to be a few different types here: Nov 04 2009 15:54:08.039456458 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x403c56a7d4a00001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xea7c0de1586275c7 vdev = 0xfca535aa8bbc70d1 (end detector) pool = atlashome pool_guid = 0xea7c0de1586275c7 pool_context = 0 pool_failmode = wait vdev_guid = 0xfca535aa8bbc70d1 vdev_type = spare parent_guid = 0x371eb0d63ce91f06 parent_type = raidz zio_err = 0 zio_offset = 0x9706d7600 zio_size = 0x8000 zio_objset = 0x46 zio_object = 0xfbcc zio_level = 0 zio_blkid = 0x23 __ttl = 0x1 __tod = 0x4af19590 0x25a0eca or Nov 02 2009 16:55:37.076615439 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xa351756c27900c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0xea7c0de1586275c7 vdev = 0x55c360b6c3e946ea (end detector) pool = atlashome pool_guid = 0xea7c0de1586275c7 pool_context = 0 pool_failmode = wait vdev_guid = 0x55c360b6c3e946ea vdev_type = disk vdev_path = /dev/dsk/c8t0d0s0 vdev_devid = id1,s...@sata_____hitachi_hds7250s______krvn67zbh9ey9h/a parent_guid = 0x371eb0d63ce91f06 parent_type = raidz zio_err = 0 zio_offset = 0x1632eee00 zio_size = 0x400 zio_objset = 0x28 zio_object = 0x797549 zio_level = 0 zio_blkid = 0x0 __ttl = 0x1 __tod = 0x4aef00f9 0x4910f0f or Oct 26 2009 15:43:43.973655977 ereport.fs.zfs.zpool nvlist version: 0 class = ereport.fs.zfs.zpool ena = 0x37f6ca58e400801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x8f607617c7160c92 (end detector) pool = atlashome pool_guid = 0x8f607617c7160c92 pool_context = 2 pool_failmode = wait __ttl = 0x1 __tod = 0x4ae5b59f 0x3a08cfa9 > At the very least, I would consider physically replacing c1t6d0. > That's an option and see if I can let the system repair more of the errors. Regarding the error with a named disk, there is only one disk named in the output so far. Richard, I'll try zpool clear as well, but wanted to wait for some feedback as this is the first time, we have hit a large number of errors. What I find strange why a single vdev is producing so many errors. I think it should not be possible to be a controller fault as these vdevs span across controllers, I've not seen any memory errors (yet), not faulty CPU messages... Thanks a lot for the input! Carsten _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss