Re: [zfs-discuss] Drive Checksum error

Glaser, David Tue, 16 Dec 2008 17:35:10 -0800

Thanks for the responses.

Richard,

Yes, zpool status returns an error:

# zpool status -xv
  pool: zpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Tue Dec  2 10:50:47 2008
config:
        NAME        STATE     READ WRITE CKSUM
        zpool1      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
        <snip>
          raidz1    ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     1
            c5t6d0  ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c8t6d0  ONLINE       0     0     0
errors: No known data errors

So, it doesn't appear to be any data errors, probably because the raiding has 
saved the data (and there was much rejoicing). 

I wasn't really attempting to kill the canary, just making sure it didn't just 
fall asleep. By clearing the error and re-running the scrub I was hoping to see 
if the error wasn't just a transient error and a real hardware I/O issue. 

I looked through the logs, but Solaris logs are worse than Linux logs at trying 
to figure out hardware errors, heh. Nothing appears to be issues with drives 
(aside from a couple entries from pulling a USB cdrom from the machine a couple 
weeks ago). 

The machine was updated (Solaris 10 U5) as of Nov 22nd, which was our last 
scheduled maintenance day. Our next is January 24th. Hopefully then we will be 
going to U6. 

I guess I'm more wondering how best to determine if it's a hardware problem on 
the disk and needs to be replaced. 

And I noticed the SATA cable comment, but I wasn't going to point it out. :)

Dave

-----Original Message-----
From: richard.ell...@sun.com [mailto:richard.ell...@sun.com] 
Sent: Tuesday, December 16, 2008 8:04 PM
To: Jonathan
Cc: Glaser, David; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Drive Checksum error

Glaser, David wrote:
> Hi all,
> 
> A few weeks ago I was inquiring of the group on how often to do zfs 
> scrubs of pools on our x4500's. Figures that the first time I try 
> to do a monthly scrub of our pools, we get one of the three machines
> to throw an error. On one of the machines, there's one disk that has 
> registered one Checksum error. Sun lists it as an 'unrecoverable I/O 
> error'. Is it really an unrecoverable error? Is the drive really bad
> (i.e. warrant a call to SUN for an RMA of the drive?)  Researching 
> the error message says that you can set the plateau of checksum 
> errors before it throws an error, but I'd figure that one is too many.

I presume you mean that a "zpool status" shows a data error?
If so, try "zpool status -xv" to see which file(s) are affected.
If ZFS is managing the redundancy, it should be able to recover
the data.

Depending on the drive, disk drive vendors spec 1 UER for every 1e15
bits read. So it is not really all that unlikely to see them on a
system the size of an X4500 which can hold ~3.8e14 bits.

> So, is there a way to see if it is a bad disk, or just zfs being a 
> pain? Should I reset the checksum error counter and re-run the scrub?

Don't kill the canary!  Check the error logs for more details, also
make sure you are up-to-date on Marvell SATA controller patches.

Jonathan wrote:
> If you start seeing hundreds of errors be sure to check things like the
> cable.  I had a SATA cable come loose on a home ZFS fileserver and scrub
> was throwing 100's of errors even though the drive itself was fine, I
> don't want to think about what could have happened with UFS...

X4500s don't have any SATA cables :-)
  -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive Checksum error

Reply via email to