[zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Dave Pooser
My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)

This morning I got a cheerful email from my monitoring script: "Zchecker has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?
Second question-- What is the best way to handle replacement (of either the
bad controller or of all three controllers if I can't identify the bad
controller)? I was thinking that I should be able to shut the server down,
remove the controller(s), install the replacement controller(s), check to
see that all the drives are visible, run zpool clear for each pool and then
do another scrub to verify the problem has been resolved. Does that sound
like a good plan?

===
pool: uberdisk1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go
config:

NAME STATE READ WRITE CKSUM
uberdisk1UNAVAIL 55 0 0  insufficient replicas
  raidz2 UNAVAIL112 0 0  insufficient replicas
c9t0d0   ONLINE   0 0 0
c9t1d0   ONLINE   0 0 0
c9t2d0   ONLINE   0 0 0
c10t0d0  UNAVAIL 4330 0  experienced I/O failures
c10t1d0  REMOVED  0 0 0
c10t2d0  ONLINE  74 0 0
c11t1d0  ONLINE   0 0 0
c11t2d0  ONLINE   0 0 0

errors: 1 data errors, use '-v' for a list

  pool: uberdisk2
 state: DEGRADED
 scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go
config:

NAME STATE READ WRITE CKSUM
uberdisk2DEGRADED 0 0 0
  raidz2 DEGRADED 0 0 0
c9t3d0   ONLINE   0 0 0
c9t4d0   ONLINE   0 0 0
c9t5d0   ONLINE   0 0 0
c10t3d0  REMOVED  0 0 0
c10t4d0  REMOVED  0 0 0
c11t3d0  ONLINE   0 0 0
c11t4d0  ONLINE   0 0 0
c11t5d0  ONLINE   0 0 0

errors: No known data errors

  pool: uberdisk3
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go
config:

NAME STATE READ WRITE CKSUM
uberdisk3DEGRADED 1 0 0
  raidz2 DEGRADED 4 0 0
c9t6d0   ONLINE   0 0 0
c9t7d0   ONLINE   0 0 0
c10t5d0  ONLINE   5 0 0
c10t6d0  ONLINE  9894 0
c10t7d0  REMOVED  0 0 0
c11t6d0  ONLINE   0 0 0
c11t7d0  ONLINE   0 0 0
c11t8d0  ONLINE   0 0 0

errors: 1 data errors, use '-v' for a list

-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Khushil Dep
Can you send output of iostat -xCzn as well as fmadm faulty please? Is. This
an E2 chassis? Are you using interposers?

On 6 Nov 2010 18:28, "Dave Pooser"  wrote:

My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)

This morning I got a cheerful email from my monitoring script: "Zchecker has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?
Second question-- What is the best way to handle replacement (of either the
bad controller or of all three controllers if I can't identify the bad
controller)? I was thinking that I should be able to shut the server down,
remove the controller(s), install the replacement controller(s), check to
see that all the drives are visible, run zpool clear for each pool and then
do another scrub to verify the problem has been resolved. Does that sound
like a good plan?

===
pool: uberdisk1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
  see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go
config:

   NAME STATE READ WRITE CKSUM
   uberdisk1UNAVAIL 55 0 0  insufficient replicas
 raidz2 UNAVAIL112 0 0  insufficient replicas
   c9t0d0   ONLINE   0 0 0
   c9t1d0   ONLINE   0 0 0
   c9t2d0   ONLINE   0 0 0
   c10t0d0  UNAVAIL 4330 0  experienced I/O failures
   c10t1d0  REMOVED  0 0 0
   c10t2d0  ONLINE  74 0 0
   c11t1d0  ONLINE   0 0 0
   c11t2d0  ONLINE   0 0 0

errors: 1 data errors, use '-v' for a list

 pool: uberdisk2
 state: DEGRADED
 scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go
config:

   NAME STATE READ WRITE CKSUM
   uberdisk2DEGRADED 0 0 0
 raidz2 DEGRADED 0 0 0
   c9t3d0   ONLINE   0 0 0
   c9t4d0   ONLINE   0 0 0
   c9t5d0   ONLINE   0 0 0
   c10t3d0  REMOVED  0 0 0
   c10t4d0  REMOVED  0 0 0
   c11t3d0  ONLINE   0 0 0
   c11t4d0  ONLINE   0 0 0
   c11t5d0  ONLINE   0 0 0

errors: No known data errors

 pool: uberdisk3
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
  see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go
config:

   NAME STATE READ WRITE CKSUM
   uberdisk3DEGRADED 1 0 0
 raidz2 DEGRADED 4 0 0
   c9t6d0   ONLINE   0 0 0
   c9t7d0   ONLINE   0 0 0
   c10t5d0  ONLINE   5 0 0
   c10t6d0  ONLINE  9894 0
   c10t7d0  REMOVED  0 0 0
   c11t6d0  ONLINE   0 0 0
   c11t7d0  ONLINE   0 0 0
   c11t8d0  ONLINE   0 0 0

errors: 1 data errors, use '-v' for a list

--
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Dave Pooser
On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep"  wrote:

> Is this  an E2 chassis? Are you using interposers?

No, it¹s an SC846A chassis. There are no interposers or expanders; six
SFF-8087 ³iPass² cables go from ports on the HBA to ports on the backplane.

> Can you send output of iostat -xCzn as well as fmadm faulty please?

(please pardon my line wrap)


# iostat -xCzn
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  255.0   15.9 20667.5 1424.4  0.0  3.00.0   11.2   0  35 c9
   34.42.3 2837.7  198.5  0.0  0.40.0   11.1   0   5 c9t0d0
   34.32.3 2837.6  198.5  0.0  0.40.0   11.3   0   5 c9t1d0
   34.42.3 2837.7  198.5  0.0  0.40.0   11.1   0   5 c9t2d0
   35.91.9 2918.2  162.1  0.0  0.40.0   11.9   0   5 c9t3d0
   35.81.9 2918.3  162.1  0.0  0.50.0   12.1   0   5 c9t4d0
   35.81.9 2918.2  162.1  0.0  0.50.0   11.9   0   5 c9t5d0
   22.21.7 1703.0  171.3  0.0  0.20.09.5   0   3 c9t6d0
   22.11.7 1696.8  171.2  0.0  0.20.09.5   0   3 c9t7d0
  239.2   15.8 19217.1 1433.5  0.0  2.80.0   10.8   0  32 c10
   34.62.3 2837.8  198.5  0.0  0.40.0   10.9   0   5 c10t0d0
   34.52.3 2837.7  198.5  0.0  0.40.0   11.0   0   5 c10t1d0
   34.42.3 2837.6  198.5  0.0  0.40.0   11.3   0   5 c10t2d0
   34.51.9 2800.5  162.1  0.0  0.40.0   12.0   0   5 c10t3d0
   34.51.9 2800.4  162.1  0.0  0.40.0   12.0   0   5 c10t4d0
   22.21.7 1703.1  171.3  0.0  0.20.09.5   0   3 c10t5d0
   22.21.7 1697.0  171.2  0.0  0.20.09.3   0   3 c10t6d0
   22.31.7 1703.1  171.3  0.0  0.20.09.2   0   3 c10t7d0
  243.5   15.5 19527.7 1397.1  0.0  2.80.0   10.9   0  32 c11
   34.52.3 2837.8  198.5  0.0  0.40.0   11.1   0   5 c11t1d0
   34.52.3 2837.9  198.5  0.0  0.40.0   11.0   0   5 c11t2d0
   35.81.9 2918.3  162.1  0.0  0.50.0   12.1   0   5 c11t3d0
   35.91.9 2918.2  162.1  0.0  0.50.0   11.9   0   5 c11t4d0
   36.21.9 2918.5  162.1  0.0  0.40.0   11.2   0   5 c11t5d0
   22.11.7 1696.8  171.2  0.0  0.20.09.5   0   3 c11t6d0
   22.21.7 1703.1  171.3  0.0  0.20.09.5   0   3 c11t7d0
   22.31.7 1697.1  171.2  0.0  0.20.09.2   0   3 c11t8d0
0.00.01.00.3  0.0  0.00.51.4   0   0 c8d0


# fmadm faulty
---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
  faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
  for more information.

Response: The device has been offlined and marked as faulted.  An
attempt
 will be made to activate a hot spare if available.

Impact  : Fault tolerance of the pool may be compromised.

Action  : Run 'zpool status -x' and replace the bad device.

---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
  faulted but still in service
Problem in  : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
  faulty

Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
  for more information.

Response: The device has been offlined and marked as faulted.  An
attempt
 will be made to activate a hot spare if available.

Impact  : Fault tolerance of the pool may be compromised.

Action  : Run 'zpool status -x' and replace the bad device.

---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
  faulted and taken out of service
Problem in  : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
  faulty

Description : The number of I/O errors associated with a 

Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Khushil Dep
Sorry u meant iostat -En I'm looking for errors

On 6 Nov 2010 18:56, "Dave Pooser"  wrote:

On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep"  wrote:

> Is this  an E2 chassis? Are you using interposers?

No, it¹s an SC846A chassis. There are no interposers or expanders; six
SFF-8087 ³iPass² cables go from ports on the HBA to ports on the backplane.


> Can you send output of iostat -xCzn as well as fmadm faulty please?
(please pardon my line wrap)


# iostat -xCzn
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 255.0   15.9 20667.5 1424.4  0.0  3.00.0   11.2   0  35 c9
  34.42.3 2837.7  198.5  0.0  0.40.0   11.1   0   5 c9t0d0
  34.32.3 2837.6  198.5  0.0  0.40.0   11.3   0   5 c9t1d0
  34.42.3 2837.7  198.5  0.0  0.40.0   11.1   0   5 c9t2d0
  35.91.9 2918.2  162.1  0.0  0.40.0   11.9   0   5 c9t3d0
  35.81.9 2918.3  162.1  0.0  0.50.0   12.1   0   5 c9t4d0
  35.81.9 2918.2  162.1  0.0  0.50.0   11.9   0   5 c9t5d0
  22.21.7 1703.0  171.3  0.0  0.20.09.5   0   3 c9t6d0
  22.11.7 1696.8  171.2  0.0  0.20.09.5   0   3 c9t7d0
 239.2   15.8 19217.1 1433.5  0.0  2.80.0   10.8   0  32 c10
  34.62.3 2837.8  198.5  0.0  0.40.0   10.9   0   5 c10t0d0
  34.52.3 2837.7  198.5  0.0  0.40.0   11.0   0   5 c10t1d0
  34.42.3 2837.6  198.5  0.0  0.40.0   11.3   0   5 c10t2d0
  34.51.9 2800.5  162.1  0.0  0.40.0   12.0   0   5 c10t3d0
  34.51.9 2800.4  162.1  0.0  0.40.0   12.0   0   5 c10t4d0
  22.21.7 1703.1  171.3  0.0  0.20.09.5   0   3 c10t5d0
  22.21.7 1697.0  171.2  0.0  0.20.09.3   0   3 c10t6d0
  22.31.7 1703.1  171.3  0.0  0.20.09.2   0   3 c10t7d0
 243.5   15.5 19527.7 1397.1  0.0  2.80.0   10.9   0  32 c11
  34.52.3 2837.8  198.5  0.0  0.40.0   11.1   0   5 c11t1d0
  34.52.3 2837.9  198.5  0.0  0.40.0   11.0   0   5 c11t2d0
  35.81.9 2918.3  162.1  0.0  0.50.0   12.1   0   5 c11t3d0
  35.91.9 2918.2  162.1  0.0  0.50.0   11.9   0   5 c11t4d0
  36.21.9 2918.5  162.1  0.0  0.40.0   11.2   0   5 c11t5d0
  22.11.7 1696.8  171.2  0.0  0.20.09.5   0   3 c11t6d0
  22.21.7 1703.1  171.3  0.0  0.20.09.5   0   3 c11t7d0
  22.31.7 1697.1  171.2  0.0  0.20.09.2   0   3 c11t8d0
   0.00.01.00.3  0.0  0.00.51.4   0   0 c8d0


# fmadm faulty
---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
 faulted but still in service
Problem in  : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703
 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
 for more information.

Response: The device has been offlined and marked as faulted.  An
attempt
will be made to activate a hot spare if available.

Impact  : Fault tolerance of the pool may be compromised.

Action  : Run 'zpool status -x' and replace the bad device.

---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
 faulted but still in service
Problem in  : zfs://pool=uberdisk1/vdev=655593d0bc77a83d
 faulty

Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-FD
 for more information.

Response: The device has been offlined and marked as faulted.  An
attempt
will be made to activate a hot spare if available.

Impact  : Fault tolerance of the pool may be compromised.

Action  : Run 'zpool status -x' and replace the bad device.

---   --
-
TIMEEVENT-ID  MSG-ID
SEVERITY
---   --
-
Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa  ZFS-8000-FDMajor

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
 faulted and taken out of service
Problem in  : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2
 faulty

De

Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Dave Pooser
On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep"  wrote:

> Sorry I meant iostat -En I'm looking for errors

#  iostat -En
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: IMATION-MAC25-0 Revision:  Serial No: 87A0079B1808000 Size: 63.89GB
<63887523840 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 
c9t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t2d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t3d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t4d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t6d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t7d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t1d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t2d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t3d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t4d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t5d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t6d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t7d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t8d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c10t0d0  Soft Errors: 0 Hard Errors

Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Khushil Dep
Similar to what I've seen before, SATA disks in a 846 chassis with hardware
and transport errors. Though in that occasion it was an E2 chassis with
interposers. How long has this system been up? Is it production or can you
offline and check all firmware on lsi controllers are up to date and match
each other?

Do and fmdump -u UUID - V on those faults and get the serial numbers of
disks that have failed. Trial and error unless you wrote down which went
where I'm afraid.

If Hitachi provide a tool like SeaTool from Segate, run it against a disk
and see if its really faulty or if the hba it was connected to is on the
blink.

Restore from backup might be inevitable unless your snapping and auto
syncing to another system?

On 6 Nov 2010 19:25, "Dave Pooser"  wrote:

On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep"  wrote:

> Sorry I meant iostat -En ...
#  iostat -En
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: IMATION-MAC25-0 Revision:  Serial No: 87A0079B1808000 Size: 63.89GB
<63887523840 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c9t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t2d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t3d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t4d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t6d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c9t7d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t1d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t2d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t3d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t4d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t5d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000398934016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c11t6d0  Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: Hitachi HDS72202 Revision: A20N Serial No:
Size: 2000.40GB <2000

Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Dave Pooser
On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep"  wrote:

> Similar to what I've seen before, SATA disks in a 846 chassis with hardware
> and transport errors. Though in that occasion it was an E2 chassis with
> interposers. How long has this system been up? Is it production or can you
> offline and check all firmware on lsi controllers are up to date and match
> each other? 

It's been up for about 6 months. I can offline them.

> Do and fmdump -u UUID - V on those faults and get the serial numbers of disks
> that have failed. Trial and error unless you wrote down which went where I'm
> afraid. 

Here's the thing, though-- I'm really not at all sure it's the disks that
failed. The idea that coincidentally I'm going to have had eight of 24 disks
report major errors, all at the same time (because I scrub weekly and didn't
catch any errors last scrub), all on the same controller-- well, that seems
much less likely than the idea that I just have a bad controller that needs
replacing.
-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread Khushil Dep
The fmdump will let you get the serial of one disk and id the controller its
on so you can swap it out and check.

On 6 Nov 2010 19:45, "Dave Pooser"  wrote:

On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep"  wrote:

> Similar to what I've seen...
It's been up for about 6 months. I can offline them.


> Do and fmdump -u UUID - V on those faults and get the serial numbers of
disks
> that have failed
Here's the thing, though-- I'm really not at all sure it's the disks that
failed. The idea that coincidentally I'm going to have had eight of 24 disks
report major errors, all at the same time (because I scrub weekly and didn't
catch any errors last scrub), all on the same controller-- well, that seems
much less likely than the idea that I just have a bad controller that needs
replacing.
--

Dave Pooser, ACSA
Manager of Information Services
Alford Media http://www.alfordmedia.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [OpenIndiana-discuss] format dumps the core

2010-11-06 Thread Jürgen Keil
> r...@tos-backup:~# pstack /dev/rdsk/core
> core '/dev/rdsk/core' of 1217:  format
> fee62e4a UDiv (4, 0, 8046c80, 80469a0, 8046a30,  8046a50) + 2a
> 08079799 auto_sense (4, 0, 8046c80, 0) + 281
> ...

Seems that one function call is missing in the back trace
between auto_sense and UDiv, because UDiv does not setup
a complete stack frame.

Looking at the source ...
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/format/auto_sense.c#819
... you can get some extra debug output
from format when you specify the "-M" option.

E.g. with an usb flash memory stick and format -eM
I get

# format -eM
Searching for disks...
c11t0d0: attempting auto configuration
Inquiry:
00 80 02 02 1f 00 00 00 53 61 6e 44 69 73 6b 20 SanDisk 
55 33 20 43 6f 6e 74 6f 75 72 20 20 20 20 20 20 U3 Contour  
34 2e 304.0
Product id: U3 Contour  
Capacity: 00 7a 46 90 00 00 02 00 
blocks:  8013456 (0x7a4690)
blksize: 512
disk name:  `r  `
Request sense for command mode sense failed
Sense data:
f0 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 
Mode sense page 0x3 failed
Request sense for command mode sense failed
Sense data:
f0 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 
Mode sense page 0x4 failed
Geometry:
pcyl:1956
ncyl:1954
heads:   128
nsects:  32
acyl:2
bcyl:0
rpm: 0
nblocks: 8013457
The current rpm value 0 is invalid, adjusting it to 3600

Geometry after adjusting for capacity:
pcyl:1956
ncyl:1954
heads:   128
nsects:  32
acyl:2
rpm: 3600

Partition 0:   128.00MB   64 cylinders
Partition 1:   128.00MB   64 cylinders
Partition 2: 3.82GB 1956 cylinders
Partition 6: 3.56GB 1825 cylinders
Partition 8: 2.00MB1 cylinders

Inquiry:
00 00 03 02 1f 00 00 02 41 54 41 20 20 20 20 20 ATA 
48 69 74 61 63 68 69 20 48 54 53 37 32 33 32 33 Hitachi HTS72323
43 33 30C30
done

c11t0d0: configured with capacity of 3.82GB
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-06 Thread McBofh

On  7/11/10 04:27 AM, Dave Pooser wrote:

My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)

This morning I got a cheerful email from my monitoring script: "Zchecker has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?


ls -alrt /dev/cfg/c10

will show you the physical path, which you can then follow


$ ls -lart /dev/cfg/c3
   1 lrwxrwxrwx   1 root root  55 Nov 12  2009 /dev/cfg/c3 -> 
../../devices/p...@0,0/pci10de,3...@a/pci1000,3...@0:scsi


you can also make use of fmtopo -V:

# /usr/lib/fm/fmd/fmtopo -V

...

hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hos
tbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
  group: protocol   version: 1   stability: Private/Private
resource  fmri  
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis
-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
label stringPCIE0 Slot
FRU   fmri  
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis
-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0
ASRU  fmri  dev:p...@0,0/pci10de,3...@a/pci1000,3...@0
  group: authority  version: 1   stability: Private/Private
product-idstringSun-Ultra-40-M2-Workstation
chassis-idstring0802FMY00N
server-id stringblinder
  group: io version: 1   stability: Private/Private
dev   string/p...@0,0/pci10de,3...@a/pci1000,3...@0
driverstringmpt
modulefmri  mod:///mod-name=mpt/mod-id=57
  group: pciversion: 1   stability: Private/Private
device-id string58
extended-capabilities stringpciexdev
class-codestring1
vendor-id string1000
assigned-addresses uint32[]  [ 2164391952 0 16384 0 256 2197946388 0 
2686517248 0 16384 2197946396 0 2686451712 0 65536 ]


note the "label" and "FRU" properties in the protocol group.


McB


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss