An update:

Well things didn't quite turn out as expected.
I decided to follow the path right to the disks for clues.
Digging into the adapter diags with LSIUTIL, revealed an Adapter Link issue.

Adapter Phy 5:  Link Down
  Invalid DWord Count                                   5,969,575
  Running Disparity Error Count                         5,782,581
  Loss of DWord Synch Count                                     0
  Phy Reset Problem Count                                       0

After replacing cables, I eventually replaced the controller and then things 
really went pear shaped.
It turns out the backplane, that ran without major issues on the Supermicro 
controller, refused to operate with the LSI SAS3081E-R (with latest code)- card 
wouldn't initialise, links only ran at 1.5Mb/s, most disks offline etc. 
Replacing the backplane (whole jbod) fixed the Adapter Link problems, but 
timeouts still occur when scrubbing. 
Oh look, the dev names moved. they used to start at c4t8d0, but it has "made it 
right" all by itself. EYHOBG!

 iostat -X -e -n
s/w h/w trn tot device
  0   0   0   0 c4t0d0
  0   0   0   0 c4t1d0
  0   2   8  10 c4t2d0
  0   3  18  21 c4t3d0
  0   0   0   0 c4t4d0
  0   2  12  14 c4t5d0
  0   1   8   9 c4t6d0
  0   2  15  17 c4t7d0
  0   0   0   0 c4t8d0
  0   0   0   0 c4t9d0
  0   0   0   0 c4t10d0
  0   0   0   0 c4t11d0
  0   0   0   0 c4t12d0
  0   0   0   0 c4t13d0
  0  11  84  95 c4t41d0
  0   8  62  70 c4t42d0
  0  10  72  82 c4t43d0
  0  19 147 166 c4t44d0
  0  12 102 114 c4t45d0
  0  19 145 164 c4t46d0
  0  13 108 121 c4t47d0
  0   7  62  69 c4t48d0
  0  14 113 127 c4t49d0
  0  11  96 107 c4t50d0
  0  11  91 102 c4t51d0
  0   8  64  72 c4t52d0
  0  13 108 121 c4t53d0
  0  11 106 117 c4t54d0
  0  10  82  92 c4t55d0
  0  10  88  98 c4t56d0
  0  12  85  97 c4t57d0
  0   6  38  44 c4t58d0
and 
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
c4t2d0   ONLINE       0     0     1  25.5K repaired
c4t55d0  ONLINE       0     0     4  102K repaired

I do note that after these errors, there are no errors in the lsi adapter diag 
logs.

Data disks are all new WD10EARS.

If the OpenSolaris and ZFS combination wasn't so robust, this would have ended 
badly.

Next step will be trying different timeout settings on the controller and see 
if that helps.

P.S. I have a client with a "suspect", nearly full, 20Tb zpool to try to scrub, 
so this is a big issue for me. A resilver of a 1Tb disk takes up to 40 hrs., so 
I expect a scrub to be a week (or two), and at present, would probably result 
in multiple disk failures.

Mark.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to