[zfs-discuss] Failing disk(s) or controller in ZFS pool?

andy thomas Tue, 14 Feb 2012 10:38:46 -0800

On one of our servers, we have a RAIDz1 ZFS pool called 'maths2'consisting of 7 x 300 Gb disks which in turn contains a single ZFSfilesystem called 'home'.

Yesterday, using the 'ls' command to list the directories within this poolcaused the command to hang for a long period period followed by an 'i/oerror' message. 'zpool status -x maths2' reports the pool is healthy but'iostat -en' shows a rather different story:


root@e450:~# iostat -en
  ---- errors ---
  s/w h/w trn tot device
    0   0   0   0 fd0
    0   0   0   0 c2t3d0
    0   0   0   0 c2t0d0
    0   0   0   0 c2t1d0
    0   0   0   0 c5t3d0
    0   0   0   0 c4t0d0
    0   0   0   0 c4t1d0
    0   0   0   0 c2t2d0
    0   0   0   0 c4t2d0
    0   0   0   0 c4t3d0
    0   0   0   0 c5t0d0
    0   0   0   0 c5t1d0
    0   0   0   0 c8t0d0
    0   0   0   0 c8t1d0
    0   0   0   0 c8t2d0
    0 503 1658 2161 c9t0d0
    0 2515 6260 8775 c9t1d0
    0   0   0   0 c8t3d0
    0 492 2024 2516 c9t2d0
    0 444 1810 2254 c9t3d0
    0   0   0   0 c5t2d0
    0   1   0   1 rmt/2

Obviously it looks like controller c9 or the cabling associated with it isin trouble (the server is an Enterprise 450 with multiple diskcontrollers). On taking the server down and running the 'probe-scsi-all'command from the OBP, one disk c9t1d0 was reported as being faulty (nomedia present) but the others seemed fine.

After booting back up, I started scrubbing the maths2 pool and for a longtime, only disk c9t1d0 reported it was being repaired. After a few hours,another disk on this controller reported being repaired:


        NAME        STATE     READ WRITE CKSUM
        maths2      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c8t3d0  ONLINE       0     0     0
            c9t0d0  ONLINE       0     0     0  21K repaired
            c9t1d0  ONLINE       0     0     0  938K repaired
            c9t2d0  ONLINE       0     0     0
            c9t3d0  ONLINE       0     0     0

errors: No known data errors

Now, does this point to a controller/cabling/backplane problem or couldall 4 disks on this controller have been corrupted in some way? The O/S isOsol snv_134 for SPARC and the server has been up & running for nearly ayear with no problems to date - there are two other RAIDz1 pools on thisserver but these are working fine.


Andy

---------------------------------
Andy Thomas,
Time Domain Systems

Tel: +44 (0)7866 556626
Fax: +44 (0)20 8372 2582
http://www.time-domain.co.uk
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Failing disk(s) or controller in ZFS pool?

Reply via email to