Hi Andy, On Feb 14, 2012, at 12:41 PM, andy thomas wrote:
> On Tue, 14 Feb 2012, Richard Elling wrote: > >> Hi Andy >> >> On Feb 14, 2012, at 10:37 AM, andy thomas wrote: >> >>> On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting >>> of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called >>> 'home'. >>> >>> Yesterday, using the 'ls' command to list the directories within this pool >>> caused the command to hang for a long period period followed by an 'i/o >>> error' message. 'zpool status -x maths2' reports the pool is healthy but >>> 'iostat -en' shows a rather different story: >>> >>> root@e450:~# iostat -en >>> ---- errors --- >>> s/w h/w trn tot device >>> 0 0 0 0 fd0 >>> 0 0 0 0 c2t3d0 >>> 0 0 0 0 c2t0d0 >>> 0 0 0 0 c2t1d0 >>> 0 0 0 0 c5t3d0 >>> 0 0 0 0 c4t0d0 >>> 0 0 0 0 c4t1d0 >>> 0 0 0 0 c2t2d0 >>> 0 0 0 0 c4t2d0 >>> 0 0 0 0 c4t3d0 >>> 0 0 0 0 c5t0d0 >>> 0 0 0 0 c5t1d0 >>> 0 0 0 0 c8t0d0 >>> 0 0 0 0 c8t1d0 >>> 0 0 0 0 c8t2d0 >>> 0 503 1658 2161 c9t0d0 >>> 0 2515 6260 8775 c9t1d0 >>> 0 0 0 0 c8t3d0 >>> 0 492 2024 2516 c9t2d0 >>> 0 444 1810 2254 c9t3d0 >>> 0 0 0 0 c5t2d0 >>> 0 1 0 1 rmt/2 >>> >>> Obviously it looks like controller c9 or the cabling associated with it is >>> in trouble (the server is an Enterprise 450 with multiple disk >>> controllers). On taking the server down and running the 'probe-scsi-all' >>> command from the OBP, one disk c9t1d0 was reported as being faulty (no >>> media present) but the others seemed fine. >> >> We see similar symptoms when a misbehaving disk (usually SATA) disrupts the >> other disks in the same fault zone. > > OK, I will replace the disk. > >>> After booting back up, I started scrubbing the maths2 pool and for a long >>> time, only disk c9t1d0 reported it was being repaired. After a few hours, >>> another disk on this controller reported being repaired: >>> >>> NAME STATE READ WRITE CKSUM >>> maths2 ONLINE 0 0 0 >>> raidz1-0 ONLINE 0 0 0 >>> c5t2d0 ONLINE 0 0 0 >>> c5t3d0 ONLINE 0 0 0 >>> c8t3d0 ONLINE 0 0 0 >>> c9t0d0 ONLINE 0 0 0 21K repaired >>> c9t1d0 ONLINE 0 0 0 938K repaired >>> c9t2d0 ONLINE 0 0 0 >>> c9t3d0 ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> Now, does this point to a controller/cabling/backplane problem or could all >>> 4 disks on this controller have been corrupted in some way? The O/S is Osol >>> snv_134 for SPARC and the server has been up & running for nearly a year >>> with no problems to date - there are two other RAIDz1 pools on this server >>> but these are working fine. >> >> Not likely. More likely the faulty disk causing issues elsewhere. > > It sems odd that 'zpool status' is not reporting a degraded status and 'zpool > status -x' is still saying "all pools are healthy". This is a little worrying > as I use remote monitoring to keep an eye on all the servers I admin (many of > which run Solaris, OpenIndiana and FreeBSD) and one thing that is checked > every 15 minutes is the pool status using 'zpool status -x'. But this seems > to result in a false sense of security and I could be blissfully unaware that > half a pool has dropped out! The integrity of the pool was not in danger. I'll bet you have a whole bunch of errors logged to syslog. > >> NB, for file and RAID systems that do not use checksums, such corruptions >> can be catastrophic. Yea ZFS! > > Yes indeed! > :-) -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss