[zfs-discuss] Apparent SAS HBA failure-- now what?
My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.) This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can't identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? === pool: uberdisk1 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go config: NAME STATE READ WRITE CKSUM uberdisk1UNAVAIL 55 0 0 insufficient replicas raidz2 UNAVAIL112 0 0 insufficient replicas c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c10t0d0 UNAVAIL 4330 0 experienced I/O failures c10t1d0 REMOVED 0 0 0 c10t2d0 ONLINE 74 0 0 c11t1d0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list pool: uberdisk2 state: DEGRADED scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go config: NAME STATE READ WRITE CKSUM uberdisk2DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 REMOVED 0 0 0 c11t3d0 ONLINE 0 0 0 c11t4d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 0 errors: No known data errors pool: uberdisk3 state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go config: NAME STATE READ WRITE CKSUM uberdisk3DEGRADED 1 0 0 raidz2 DEGRADED 4 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t5d0 ONLINE 5 0 0 c10t6d0 ONLINE 9894 0 c10t7d0 REMOVED 0 0 0 c11t6d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c11t8d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
Can you send output of iostat -xCzn as well as fmadm faulty please? Is. This an E2 chassis? Are you using interposers? On 6 Nov 2010 18:28, "Dave Pooser" wrote: My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.) This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can't identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? === pool: uberdisk1 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go config: NAME STATE READ WRITE CKSUM uberdisk1UNAVAIL 55 0 0 insufficient replicas raidz2 UNAVAIL112 0 0 insufficient replicas c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c10t0d0 UNAVAIL 4330 0 experienced I/O failures c10t1d0 REMOVED 0 0 0 c10t2d0 ONLINE 74 0 0 c11t1d0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list pool: uberdisk2 state: DEGRADED scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go config: NAME STATE READ WRITE CKSUM uberdisk2DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 REMOVED 0 0 0 c11t3d0 ONLINE 0 0 0 c11t4d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 0 errors: No known data errors pool: uberdisk3 state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go config: NAME STATE READ WRITE CKSUM uberdisk3DEGRADED 1 0 0 raidz2 DEGRADED 4 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t5d0 ONLINE 5 0 0 c10t6d0 ONLINE 9894 0 c10t7d0 REMOVED 0 0 0 c11t6d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c11t8d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" wrote: > Is this an E2 chassis? Are you using interposers? No, it¹s an SC846A chassis. There are no interposers or expanders; six SFF-8087 ³iPass² cables go from ports on the HBA to ports on the backplane. > Can you send output of iostat -xCzn as well as fmadm faulty please? (please pardon my line wrap) # iostat -xCzn extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 255.0 15.9 20667.5 1424.4 0.0 3.00.0 11.2 0 35 c9 34.42.3 2837.7 198.5 0.0 0.40.0 11.1 0 5 c9t0d0 34.32.3 2837.6 198.5 0.0 0.40.0 11.3 0 5 c9t1d0 34.42.3 2837.7 198.5 0.0 0.40.0 11.1 0 5 c9t2d0 35.91.9 2918.2 162.1 0.0 0.40.0 11.9 0 5 c9t3d0 35.81.9 2918.3 162.1 0.0 0.50.0 12.1 0 5 c9t4d0 35.81.9 2918.2 162.1 0.0 0.50.0 11.9 0 5 c9t5d0 22.21.7 1703.0 171.3 0.0 0.20.09.5 0 3 c9t6d0 22.11.7 1696.8 171.2 0.0 0.20.09.5 0 3 c9t7d0 239.2 15.8 19217.1 1433.5 0.0 2.80.0 10.8 0 32 c10 34.62.3 2837.8 198.5 0.0 0.40.0 10.9 0 5 c10t0d0 34.52.3 2837.7 198.5 0.0 0.40.0 11.0 0 5 c10t1d0 34.42.3 2837.6 198.5 0.0 0.40.0 11.3 0 5 c10t2d0 34.51.9 2800.5 162.1 0.0 0.40.0 12.0 0 5 c10t3d0 34.51.9 2800.4 162.1 0.0 0.40.0 12.0 0 5 c10t4d0 22.21.7 1703.1 171.3 0.0 0.20.09.5 0 3 c10t5d0 22.21.7 1697.0 171.2 0.0 0.20.09.3 0 3 c10t6d0 22.31.7 1703.1 171.3 0.0 0.20.09.2 0 3 c10t7d0 243.5 15.5 19527.7 1397.1 0.0 2.80.0 10.9 0 32 c11 34.52.3 2837.8 198.5 0.0 0.40.0 11.1 0 5 c11t1d0 34.52.3 2837.9 198.5 0.0 0.40.0 11.0 0 5 c11t2d0 35.81.9 2918.3 162.1 0.0 0.50.0 12.1 0 5 c11t3d0 35.91.9 2918.2 162.1 0.0 0.50.0 11.9 0 5 c11t4d0 36.21.9 2918.5 162.1 0.0 0.40.0 11.2 0 5 c11t5d0 22.11.7 1696.8 171.2 0.0 0.20.09.5 0 3 c11t6d0 22.21.7 1703.1 171.3 0.0 0.20.09.5 0 3 c11t7d0 22.31.7 1697.1 171.2 0.0 0.20.09.2 0 3 c11t8d0 0.00.01.00.3 0.0 0.00.51.4 0 0 c8d0 # fmadm faulty --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2 ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8 ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulted but still in service Problem in : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulted and taken out of service Problem in : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulty Description : The number of I/O errors associated with a
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
Sorry u meant iostat -En I'm looking for errors On 6 Nov 2010 18:56, "Dave Pooser" wrote: On 11/6/10 Nov 6, 1:35 PM, "Khushil Dep" wrote: > Is this an E2 chassis? Are you using interposers? No, it¹s an SC846A chassis. There are no interposers or expanders; six SFF-8087 ³iPass² cables go from ports on the HBA to ports on the backplane. > Can you send output of iostat -xCzn as well as fmadm faulty please? (please pardon my line wrap) # iostat -xCzn extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 255.0 15.9 20667.5 1424.4 0.0 3.00.0 11.2 0 35 c9 34.42.3 2837.7 198.5 0.0 0.40.0 11.1 0 5 c9t0d0 34.32.3 2837.6 198.5 0.0 0.40.0 11.3 0 5 c9t1d0 34.42.3 2837.7 198.5 0.0 0.40.0 11.1 0 5 c9t2d0 35.91.9 2918.2 162.1 0.0 0.40.0 11.9 0 5 c9t3d0 35.81.9 2918.3 162.1 0.0 0.50.0 12.1 0 5 c9t4d0 35.81.9 2918.2 162.1 0.0 0.50.0 11.9 0 5 c9t5d0 22.21.7 1703.0 171.3 0.0 0.20.09.5 0 3 c9t6d0 22.11.7 1696.8 171.2 0.0 0.20.09.5 0 3 c9t7d0 239.2 15.8 19217.1 1433.5 0.0 2.80.0 10.8 0 32 c10 34.62.3 2837.8 198.5 0.0 0.40.0 10.9 0 5 c10t0d0 34.52.3 2837.7 198.5 0.0 0.40.0 11.0 0 5 c10t1d0 34.42.3 2837.6 198.5 0.0 0.40.0 11.3 0 5 c10t2d0 34.51.9 2800.5 162.1 0.0 0.40.0 12.0 0 5 c10t3d0 34.51.9 2800.4 162.1 0.0 0.40.0 12.0 0 5 c10t4d0 22.21.7 1703.1 171.3 0.0 0.20.09.5 0 3 c10t5d0 22.21.7 1697.0 171.2 0.0 0.20.09.3 0 3 c10t6d0 22.31.7 1703.1 171.3 0.0 0.20.09.2 0 3 c10t7d0 243.5 15.5 19527.7 1397.1 0.0 2.80.0 10.9 0 32 c11 34.52.3 2837.8 198.5 0.0 0.40.0 11.1 0 5 c11t1d0 34.52.3 2837.9 198.5 0.0 0.40.0 11.0 0 5 c11t2d0 35.81.9 2918.3 162.1 0.0 0.50.0 12.1 0 5 c11t3d0 35.91.9 2918.2 162.1 0.0 0.50.0 11.9 0 5 c11t4d0 36.21.9 2918.5 162.1 0.0 0.40.0 11.2 0 5 c11t5d0 22.11.7 1696.8 171.2 0.0 0.20.09.5 0 3 c11t6d0 22.21.7 1703.1 171.3 0.0 0.20.09.5 0 3 c11t7d0 22.31.7 1697.1 171.2 0.0 0.20.09.2 0 3 c11t8d0 0.00.01.00.3 0.0 0.00.51.4 0 0 c8d0 # fmadm faulty --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:53 89ea2588-6dd8-4d72-e3fd-c2a4c4a8dda2 ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulted but still in service Problem in : zfs://pool=uberdisk3/vdev=6cdf461a5ecbe703 faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:25 6ff5d64e-cf64-c2e3-864f-cc59c267c0e8 ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulted but still in service Problem in : zfs://pool=uberdisk1/vdev=655593d0bc77a83d faulty Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Nov 06 06:33:20 2c0236bb-53e2-e271-d6af-a21c2f0976aa ZFS-8000-FDMajor Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulted and taken out of service Problem in : zfs://pool=uberdisk1/vdev=3b0c0e48668e3bf2 faulty De
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" wrote: > Sorry I meant iostat -En I'm looking for errors # iostat -En c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: IMATION-MAC25-0 Revision: Serial No: 87A0079B1808000 Size: 63.89GB <63887523840 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c9t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c10t0d0 Soft Errors: 0 Hard Errors
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
Similar to what I've seen before, SATA disks in a 846 chassis with hardware and transport errors. Though in that occasion it was an E2 chassis with interposers. How long has this system been up? Is it production or can you offline and check all firmware on lsi controllers are up to date and match each other? Do and fmdump -u UUID - V on those faults and get the serial numbers of disks that have failed. Trial and error unless you wrote down which went where I'm afraid. If Hitachi provide a tool like SeaTool from Segate, run it against a disk and see if its really faulty or if the hba it was connected to is on the blink. Restore from backup might be inevitable unless your snapping and auto syncing to another system? On 6 Nov 2010 19:25, "Dave Pooser" wrote: On 11/6/10 Nov 6, 2:21 PM, "Khushil Dep" wrote: > Sorry I meant iostat -En ... # iostat -En c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: IMATION-MAC25-0 Revision: Serial No: 87A0079B1808000 Size: 63.89GB <63887523840 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c9t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c9t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000398934016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 c11t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: Hitachi HDS72202 Revision: A20N Serial No: Size: 2000.40GB <2000
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" wrote: > Similar to what I've seen before, SATA disks in a 846 chassis with hardware > and transport errors. Though in that occasion it was an E2 chassis with > interposers. How long has this system been up? Is it production or can you > offline and check all firmware on lsi controllers are up to date and match > each other? It's been up for about 6 months. I can offline them. > Do and fmdump -u UUID - V on those faults and get the serial numbers of disks > that have failed. Trial and error unless you wrote down which went where I'm > afraid. Here's the thing, though-- I'm really not at all sure it's the disks that failed. The idea that coincidentally I'm going to have had eight of 24 disks report major errors, all at the same time (because I scrub weekly and didn't catch any errors last scrub), all on the same controller-- well, that seems much less likely than the idea that I just have a bad controller that needs replacing. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
The fmdump will let you get the serial of one disk and id the controller its on so you can swap it out and check. On 6 Nov 2010 19:45, "Dave Pooser" wrote: On 11/6/10 Nov 6, 2:35 PM, "Khushil Dep" wrote: > Similar to what I've seen... It's been up for about 6 months. I can offline them. > Do and fmdump -u UUID - V on those faults and get the serial numbers of disks > that have failed Here's the thing, though-- I'm really not at all sure it's the disks that failed. The idea that coincidentally I'm going to have had eight of 24 disks report major errors, all at the same time (because I scrub weekly and didn't catch any errors last scrub), all on the same controller-- well, that seems much less likely than the idea that I just have a bad controller that needs replacing. -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OpenIndiana-discuss] format dumps the core
> r...@tos-backup:~# pstack /dev/rdsk/core > core '/dev/rdsk/core' of 1217: format > fee62e4a UDiv (4, 0, 8046c80, 80469a0, 8046a30, 8046a50) + 2a > 08079799 auto_sense (4, 0, 8046c80, 0) + 281 > ... Seems that one function call is missing in the back trace between auto_sense and UDiv, because UDiv does not setup a complete stack frame. Looking at the source ... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/format/auto_sense.c#819 ... you can get some extra debug output from format when you specify the "-M" option. E.g. with an usb flash memory stick and format -eM I get # format -eM Searching for disks... c11t0d0: attempting auto configuration Inquiry: 00 80 02 02 1f 00 00 00 53 61 6e 44 69 73 6b 20 SanDisk 55 33 20 43 6f 6e 74 6f 75 72 20 20 20 20 20 20 U3 Contour 34 2e 304.0 Product id: U3 Contour Capacity: 00 7a 46 90 00 00 02 00 blocks: 8013456 (0x7a4690) blksize: 512 disk name: `r ` Request sense for command mode sense failed Sense data: f0 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Mode sense page 0x3 failed Request sense for command mode sense failed Sense data: f0 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Mode sense page 0x4 failed Geometry: pcyl:1956 ncyl:1954 heads: 128 nsects: 32 acyl:2 bcyl:0 rpm: 0 nblocks: 8013457 The current rpm value 0 is invalid, adjusting it to 3600 Geometry after adjusting for capacity: pcyl:1956 ncyl:1954 heads: 128 nsects: 32 acyl:2 rpm: 3600 Partition 0: 128.00MB 64 cylinders Partition 1: 128.00MB 64 cylinders Partition 2: 3.82GB 1956 cylinders Partition 6: 3.56GB 1825 cylinders Partition 8: 2.00MB1 cylinders Inquiry: 00 00 03 02 1f 00 00 02 41 54 41 20 20 20 20 20 ATA 48 69 74 61 63 68 69 20 48 54 53 37 32 33 32 33 Hitachi HTS72323 43 33 30C30 done c11t0d0: configured with capacity of 3.82GB -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
On 7/11/10 04:27 AM, Dave Pooser wrote: My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.) This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? ls -alrt /dev/cfg/c10 will show you the physical path, which you can then follow $ ls -lart /dev/cfg/c3 1 lrwxrwxrwx 1 root root 55 Nov 12 2009 /dev/cfg/c3 -> ../../devices/p...@0,0/pci10de,3...@a/pci1000,3...@0:scsi you can also make use of fmtopo -V: # /usr/lib/fm/fmd/fmtopo -V ... hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hos tbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis -id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 label stringPCIE0 Slot FRU fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis -id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0 ASRU fmri dev:p...@0,0/pci10de,3...@a/pci1000,3...@0 group: authority version: 1 stability: Private/Private product-idstringSun-Ultra-40-M2-Workstation chassis-idstring0802FMY00N server-id stringblinder group: io version: 1 stability: Private/Private dev string/p...@0,0/pci10de,3...@a/pci1000,3...@0 driverstringmpt modulefmri mod:///mod-name=mpt/mod-id=57 group: pciversion: 1 stability: Private/Private device-id string58 extended-capabilities stringpciexdev class-codestring1 vendor-id string1000 assigned-addresses uint32[] [ 2164391952 0 16384 0 256 2197946388 0 2686517248 0 16384 2197946396 0 2686451712 0 65536 ] note the "label" and "FRU" properties in the protocol group. McB ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss