Re: [zfs-discuss] Help needed to find out where the problem is

Carsten Aulbert Fri, 27 Nov 2009 00:45:32 -0800

Hi all,

On Thursday 26 November 2009 17:38:42 Cindy Swearingen wrote:
> Did anything about this configuration change before the checksum errors
> occurred?
>


No, This machine is running in this configuration for a couple of weeks now

> The errors on c1t6d0 are severe enough that your spare kicked in.
> 
Yes and overnight more spare would have kicked in if available:

s13:~# zpool status                                         
  pool: atlashome                                           
 state: DEGRADED                                            
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors  
        using 'zpool clear' or replace the device with 'zpool replace'.     
   see: http://www.sun.com/msg/ZFS-8000-9P                                  
 scrub: resilver completed after 5h46m with 0 errors on Thu Nov 26 15:55:22 
2009
config:                                                                         

        NAME          STATE     READ WRITE CKSUM
        atlashome     DEGRADED     0     0     0
          raidz1      ONLINE       0     0     0
            c0t0d0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c5t0d0    ONLINE       0     0     0
            c7t0d0    ONLINE       0     0     0
            c8t0d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     1
            c6t1d0    ONLINE       0     0     6
            c7t1d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c8t1d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     3
            c6t2d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c7t2d0    ONLINE       0     0     0
            c8t2d0    ONLINE       0     0     1
            c0t3d0    ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
            c5t3d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
            c7t3d0    ONLINE       0     0     0
            c8t3d0    ONLINE       0     0     0
            c0t4d0    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t4d0    ONLINE       0     0     0
            c7t4d0    ONLINE       0     0     0
            c8t4d0    ONLINE       0     0     0
            c0t5d0    ONLINE       0     0     1
            c1t5d0    ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c6t5d0    ONLINE       0     0     0
            c7t5d0    ONLINE       0     0     0
            c8t5d0    ONLINE       0     0     1
            c0t6d0    ONLINE       0     0     0
          raidz1      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c1t6d0  DEGRADED     6     0    17  too many errors
              c8t7d0  ONLINE       0     0     0  130G resilvered
            c5t6d0    ONLINE       0     0     0
            c6t6d0    DEGRADED     0     0    41  too many errors
            c7t6d0    DEGRADED     1     0    14  too many errors
            c8t6d0    ONLINE       0     0     1
          raidz1      ONLINE       0     0     0
            c0t7d0    ONLINE       0     0     0
            c1t7d0    ONLINE       0     0     1
            c5t7d0    ONLINE       0     0     0
            c6t7d0    ONLINE       0     0     0
            c7t7d0    ONLINE       0     0     0
        logs
          c6t4d0      ONLINE       0     0     0
        spares
          c8t7d0      INUSE     currently in use

errors: No known data errors
> You can use the fmdump -eV command to review the disk errors that FMA has
> detected. This command can generate a lot of output but you can see if
> the checksum errors on the disks are transient or if they occur repeatedly.
> 

Hmm, The output does not seem to stop. After about 1.3 GB of file size I 
stopped it. There seem to be a few different types here:

Nov 04 2009 15:54:08.039456458 ereport.fs.zfs.checksum
nvlist version: 0
        class = ereport.fs.zfs.checksum
        ena = 0x403c56a7d4a00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0xea7c0de1586275c7
                vdev = 0xfca535aa8bbc70d1
        (end detector)

        pool = atlashome
        pool_guid = 0xea7c0de1586275c7
        pool_context = 0
        pool_failmode = wait
        vdev_guid = 0xfca535aa8bbc70d1
        vdev_type = spare
        parent_guid = 0x371eb0d63ce91f06
        parent_type = raidz
        zio_err = 0
        zio_offset = 0x9706d7600
        zio_size = 0x8000
        zio_objset = 0x46
        zio_object = 0xfbcc
        zio_level = 0
        zio_blkid = 0x23
        __ttl = 0x1
        __tod = 0x4af19590 0x25a0eca

or
Nov 02 2009 16:55:37.076615439 ereport.fs.zfs.checksum
nvlist version: 0                                     
        class = ereport.fs.zfs.checksum               
        ena = 0xa351756c27900c01                      
        detector = (embedded nvlist)                  
        nvlist version: 0                             
                version = 0x0                         
                scheme = zfs                          
                pool = 0xea7c0de1586275c7             
                vdev = 0x55c360b6c3e946ea             
        (end detector)                                

        pool = atlashome
        pool_guid = 0xea7c0de1586275c7
        pool_context = 0              
        pool_failmode = wait          
        vdev_guid = 0x55c360b6c3e946ea
        vdev_type = disk              
        vdev_path = /dev/dsk/c8t0d0s0 
        vdev_devid = id1,s...@sata_____hitachi_hds7250s______krvn67zbh9ey9h/a
        parent_guid = 0x371eb0d63ce91f06
        parent_type = raidz
        zio_err = 0
        zio_offset = 0x1632eee00
        zio_size = 0x400
        zio_objset = 0x28
        zio_object = 0x797549
        zio_level = 0
        zio_blkid = 0x0
        __ttl = 0x1
        __tod = 0x4aef00f9 0x4910f0f

or

Oct 26 2009 15:43:43.973655977 ereport.fs.zfs.zpool
nvlist version: 0
        class = ereport.fs.zfs.zpool
        ena = 0x37f6ca58e400801
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0x8f607617c7160c92
        (end detector)

        pool = atlashome
        pool_guid = 0x8f607617c7160c92
        pool_context = 2
        pool_failmode = wait
        __ttl = 0x1
        __tod = 0x4ae5b59f 0x3a08cfa9


> At the very least, I would consider physically replacing c1t6d0.
> 

That's an option and see if I can let the system repair more of the errors. 
Regarding the error with a named disk, there is only one disk named in the 
output so far.

Richard, I'll try zpool clear as well, but wanted to wait for some feedback as 
this is the first time, we have hit a large number of errors.

What I find strange why a single vdev is producing so many errors. I think it 
should not be possible to be a controller fault as these vdevs span across 
controllers, I've not seen any memory errors (yet), not faulty CPU messages...

Thanks a lot for the input!

Carsten
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Help needed to find out where the problem is

Reply via email to