Am 28.05.12 00:35, schrieb Richard Elling:
On May 27, 2012, at 12:52 PM, Stephan Budach wrote:
Hi,
today I issued a scrub on one of my zpools and after some time I
noticed that one of the vdevs became degraded due to some drive
having cksum errors. The spare kicked in and the drive got
resilvered, but why does the spare drive now also show almost the
same number of cksum errors, as the degraded drive?
The answer is not available via zpool status. You will need to look at
the FMA diagnosis:
fmadm faulty
more clues can be found in the FMA error reports:
fmdump -eV
Thanks - I had taken a look at the FMA diagnosis, but hadn't shared it
in my first post. FMA only shows one instance as of yesterday:
root@solaris11c:~# fmadm faulty |less
--------------- ------------------------------------ --------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ --------------
---------
Mai 27 10:24:24 f0601f5f-cb8b-67bc-bd63-e71948ea8428 ZFS-8000-GH Major
Host : solaris11c
Platform : SUN-FIRE-X4170-M2-SERVER Chassis_id : 1046FMM0NH
Product_sn : 1046FMM0NH
Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
faulted but still providing degraded service
Problem in : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
faulted but still providing degraded service
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
--------------- ------------------------------------ --------------
---------
TIME EVENT-ID MSG-ID
SEVERITY
--------------- ------------------------------------ --------------
---------
Mär 15 16:34:52 5ad04cb0-af03-e84b-cd8a-a07aff7aec2c PCIEX-8000-J5 Major
I thought this to be the instance when the vdev initially got degraded
and there have been no more errors afterwards, while the resilver took
place, so I tend to think that the spare drive is indeed okay.
Thanks,
budy
-- richard
root@solaris11c:~# zpool status obelixData
pool: obelixData
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27
21:15:32 2012
config:
NAME STATE READ WRITE CKSUM
obelixData DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c9t2100001378AC02DDd1 ONLINE 0 0 0
c9t2100001378AC02F4d1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c9t2100001378AC02F4d0 ONLINE 0 0 0
c9t2100001378AC02DDd0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c9t2100001378AC02DDd2 ONLINE 0 0 0
c9t2100001378AC02F4d2 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c9t2100001378AC02DDd3 ONLINE 0 0 0
c9t2100001378AC02F4d3 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c9t2100001378AC02DDd5 ONLINE 0 0 0
c9t2100001378AC02F4d5 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
c9t2100001378AC02DDd4 ONLINE 0 0 0
c9t2100001378AC02F4d4 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
c9t2100001378AC02DDd6 ONLINE 0 0 0
c9t2100001378AC02F4d6 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
c9t2100001378AC02DDd7 ONLINE 0 0 0
c9t2100001378AC02F4d7 ONLINE 0 0 0
mirror-8 ONLINE 0 0 0
c9t2100001378AC02DDd8 ONLINE 0 0 0
c9t2100001378AC02F4d8 ONLINE 0 0 0
mirror-9 DEGRADED 0 0 0
c9t2100001378AC02DDd9 ONLINE 0 0 0
spare-1 DEGRADED 0 0 10
c9t2100001378AC02F4d9 DEGRADED 0 0 22 too many errors
c9t2100001378AC02BFd1 ONLINE 0 0 23
mirror-10 ONLINE 0 0 0
c9t2100001378AC02DDd10 ONLINE 0 0 0
c9t2100001378AC02F4d10 ONLINE 0 0 0
mirror-11 ONLINE 0 0 0
c9t2100001378AC02DDd11 ONLINE 0 0 0
c9t2100001378AC02F4d11 ONLINE 0 0 0
mirror-12 ONLINE 0 0 0
c9t2100001378AC02DDd12 ONLINE 0 0 0
c9t2100001378AC02F4d12 ONLINE 0 0 0
mirror-13 ONLINE 0 0 0
c9t2100001378AC02DDd13 ONLINE 0 0 0
c9t2100001378AC02F4d13 ONLINE 0 0 0
mirror-14 ONLINE 0 0 0
c9t2100001378AC02DDd14 ONLINE 0 0 0
c9t2100001378AC02F4d14 ONLINE 0 0 0
logs
mirror-15 ONLINE 0 0 0
c9t2100001378AC02D9d0 ONLINE 0 0 0
c9t2100001378AC02BFd0 ONLINE 0 0 0
spares
c9t2100001378AC02BFd1 INUSE currently in use
What would be the best way to proceed? The drive
c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE,
but it shows 23 cksum errors, while the drive that became degraded
only shows 22 cksum errors.
What would be the best procedure to continue? Would one now first run
another scrub and detach the degraded drive afterwards, or detach the
degrades drive immediately and run a scrub afterwards?
Thanks,
budy
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss