Re: [zfs-discuss] Failing disk(s) or controller in ZFS pool?

Richard Elling Tue, 14 Feb 2012 14:01:21 -0800

Hi Andy,

On Feb 14, 2012, at 12:41 PM, andy thomas wrote:


> On Tue, 14 Feb 2012, Richard Elling wrote:
> 
>> Hi Andy
>> 
>> On Feb 14, 2012, at 10:37 AM, andy thomas wrote:
>> 
>>> On one of our servers, we have a RAIDz1 ZFS pool called 'maths2' consisting 
>>> of 7 x 300 Gb disks which in turn contains a single ZFS filesystem called 
>>> 'home'.
>>> 
>>> Yesterday, using the 'ls' command to list the directories within this pool 
>>> caused the command to hang for a long period period followed by an 'i/o 
>>> error' message. 'zpool status -x maths2' reports the pool is healthy but 
>>> 'iostat -en' shows a rather different story:
>>> 
>>> root@e450:~# iostat -en
>>> ---- errors ---
>>> s/w h/w trn tot device
>>>   0   0   0   0 fd0
>>>   0   0   0   0 c2t3d0
>>>   0   0   0   0 c2t0d0
>>>   0   0   0   0 c2t1d0
>>>   0   0   0   0 c5t3d0
>>>   0   0   0   0 c4t0d0
>>>   0   0   0   0 c4t1d0
>>>   0   0   0   0 c2t2d0
>>>   0   0   0   0 c4t2d0
>>>   0   0   0   0 c4t3d0
>>>   0   0   0   0 c5t0d0
>>>   0   0   0   0 c5t1d0
>>>   0   0   0   0 c8t0d0
>>>   0   0   0   0 c8t1d0
>>>   0   0   0   0 c8t2d0
>>>   0 503 1658 2161 c9t0d0
>>>   0 2515 6260 8775 c9t1d0
>>>   0   0   0   0 c8t3d0
>>>   0 492 2024 2516 c9t2d0
>>>   0 444 1810 2254 c9t3d0
>>>   0   0   0   0 c5t2d0
>>>   0   1   0   1 rmt/2
>>> 
>>> Obviously it looks like controller c9 or the cabling associated with it is 
>>> in trouble (the server is an Enterprise 450 with multiple disk 
>>> controllers). On taking the server down and running the 'probe-scsi-all' 
>>> command from the OBP, one disk c9t1d0 was reported as being faulty (no 
>>> media present) but the others seemed fine.
>> 
>> We see similar symptoms when a misbehaving disk (usually SATA) disrupts the
>> other disks in the same fault zone.
> 
> OK, I will replace the disk.
> 
>>> After booting back up, I started scrubbing the maths2 pool and for a long 
>>> time, only disk c9t1d0 reported it was being repaired. After a few hours, 
>>> another disk on this controller reported being repaired:
>>> 
>>>       NAME        STATE     READ WRITE CKSUM
>>>       maths2      ONLINE       0     0     0
>>>         raidz1-0  ONLINE       0     0     0
>>>           c5t2d0  ONLINE       0     0     0
>>>           c5t3d0  ONLINE       0     0     0
>>>           c8t3d0  ONLINE       0     0     0
>>>           c9t0d0  ONLINE       0     0     0  21K repaired
>>>           c9t1d0  ONLINE       0     0     0  938K repaired
>>>           c9t2d0  ONLINE       0     0     0
>>>           c9t3d0  ONLINE       0     0     0
>>> 
>>> errors: No known data errors
>>> 
>>> Now, does this point to a controller/cabling/backplane problem or could all 
>>> 4 disks on this controller have been corrupted in some way? The O/S is Osol 
>>> snv_134 for SPARC and the server has been up & running for nearly a year 
>>> with no problems to date - there are two other RAIDz1 pools on this server 
>>> but these are working fine.
>> 
>> Not likely. More likely the faulty disk causing issues elsewhere.
> 
> It sems odd that 'zpool status' is not reporting a degraded status and 'zpool 
> status -x' is still saying "all pools are healthy". This is a little worrying 
> as I use remote monitoring to keep an eye on all the servers I admin (many of 
> which run Solaris, OpenIndiana and FreeBSD) and one thing that is checked 
> every 15 minutes is the pool status using 'zpool status -x'. But this seems 
> to result in a false sense of security and I could be blissfully unaware that 
> half a pool has dropped out!

The integrity of the pool was not in danger. I'll bet you have a whole bunch of
errors logged to syslog.

> 
>> NB, for file and RAID systems that do not use checksums, such corruptions
>> can be catastrophic. Yea ZFS!
> 
> Yes indeed!
> 

:-)
 -- richard

--
DTrace Conference, April 3, 2012, 
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Failing disk(s) or controller in ZFS pool?

Reply via email to