Re: PERC2/Si won't failover

2005-01-31 Thread Andrew Kinney
We had the same thing happen on a PE2500 with PERC3/DI, different 
drive configuration.

Text from my tech "diary" of sorts that I keep regarding unique 
problems I run into:

"The system would not boot after all the replacements were there, 
claiming "no boot device." For one reason or another, the RAID 
controller detected the new drive, but rather than just add the drive 
into the container and begin rebuilding, it simply offlined the 
container, which resulted in the "no boot device" message. I suspect 
this was because we did not properly inform the controller that were 
were taking a drive offline before we did so. We did it while the 
machine was turned off. We probably would have had better luck if we 
had booted the system, gone into afacli, prepare the enclosure slot 
to remove the drive, remove the drive, insert the new drive, and 
issue the proper commands to begin the rebuild if it doesn't start 
automatically. The OS afacli is much more robust than the RAID BIOS 
utilities.

The solution to get the drive into the RAID container and get it to 
begin rebuilding was to go into the RAID BIOS and assign the new 
drive as the failover drive. As soon as I did that, the container 
started rebuilding. I exited the utility (which automatically saves 
any settings) and rebooted. The system came back up just fine, like 
normal, and is now happily rebuilding the RAID array. No data was 
lost  and we now have as close to a completely new system as you can 
get short of replacing the entire thing."

It was just one part of a major system overhaul due to a "ghost" in 
the SCSI system that kept offlining our container in near-random non-
reproducable conditions, but I suspect a similar procedure may help 
in your instance.

Andrew

On 31 Jan 2005 at 14:49, Kit Gerrits wrote:

> Hey all!
> 
> I have a PowerEDGE 2400 with  PERC2/Si with 4x9GB Drives with RedHat
> EL 3.0 Container 0: plain 9GB drive (O/S) Container 1: 3x9GB in RAID5
> (data)
> 
> After getting I/O Errors (and gettinge a strange noise from drive
> 0:3:0), I did the unthinkable: I pulled the drive from the chassis
> without shutting it down. (oops) I have now verified the drive,
> cleaned off the partition and rescanned the bus. ...but the drive
> won't failover
> 
> I have set it to failover, but the PERC won't failover the drive, even
> after a (warm) reboot.
> 
> Did I forget anything?
> 
> Thanks in advance,
> 
> Kit Gerrits
> [EMAIL PROTECTED]
> 
> 
> ---
> Debugging info:
> ---
> 
> AFA0> disk list
> Executing: disk list
> 
> B:ID:L  Device Type BlocksBytes/Block UsageShared
> Rate --  --  - --- 
> --  0:00:0   Disk17783240  512 Initialized
>  NO 80 0:01:0   Disk17783240  512
> Initialized  NO 80 0:02:0   Disk17783240  512 
>Initialized  NO 80 0:03:0   Disk17783240  512  
>   Initialized  NO 80
> 
> AFA0> container show failover
> Executing: container show failover
> 
> Container Scsi B:ID:L
> - --
>   0   --- No Devices Assigned ---
>   1   0:03:0
> 
> AFA0> container list
> Executing: container list
> Num  Total  Oth Chunk  Scsi   Partition
> Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
> - -- -- --- -- --- -- -
>  0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB
>  /dev/sda NT
> 
>  1RAID-5 16.9GB   32KB Open0:01:0 64.0KB:8.47GB
>  /dev/sdb DATA 0:02:0 64.0KB:8.47GB
>?:??:?  - Missing -
> 
> AFA0> controller show au
> Executing: controller show automatic_failover
> Automatic failover ENABLED
> 
> AFA0> container scrub 1
> Executing: container scrub 1
> Command Error:  (consistency check)
>  operation on the container because one or more of the container's
> partitions fa iled.  >
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in the body of a message to [EMAIL PROTECTED] More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> 


Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aacraid on Dell PowerEdge 1800

2005-03-11 Thread Andrew Kinney
On 11 Mar 2005 at 9:52, Ryan wrote:

> Hello,
>   I've been through several of the news groups over the past few days
>   and
> haven't found an exact answer to my question.
> 
>   I'm trying to install Fedora Core 3 on a Dell PowerEdge 1800 server
>   that
> I just purchased, but the version of the aacraid driver for the SATA
> raid controller changed and I can't install.  
> 
>   Just like the newsgroups state, Fedora Core 2 installs just fine. 
>   My
> problem is that I can't afford to run software that is outdated. 
> There have been several security issues lately and FC3 seems to have
> fixes for them.
> 
> Any help is greatly appreciated.
> 
> Thank you.
> 
> -Ryan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in the body of a message to [EMAIL PROTECTED] More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> 

This is somewhat off-topic, so I'll be brief.  If you can install 
Fedora Core 2 and the only reason you want Fedora Core 3 is newer 
software, "man yum" after installing the 'yum' package (or choose a 
different updater to suit your prefs) will be your friend.  In other 
words, a newer OS isn't the only way to get newer software, 
especially on an OS that has a decent package management system (RPM 
in this instance).  FWIW.

Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aacraid died on kernel 2.4.27

2005-03-11 Thread Andrew Kinney
On 11 Mar 2005 at 20:31, Nic Ferrier wrote:

> The machine I am having trouble with has been running MS Windows for 2
> years.
> 
> I just put linux on it (with no other changes) and we get regular
> (twice daily) catastrophic crashes.
> 
> Can this be a controller problem? I'm not a hardware expert but it
> doesn't sound like one to me.

It can be a controller problem, but it can also be a drive problem, 
cable problem, firmware problem, or a backplane problem.  We had a 
similar instance that was resolved by replacing the drive with a 
different brand, replacing the backplane, replacing the cabling, 
replacing the ROMB, and getting the newest firmware.  Now, drives 
fail gracefully instead of taking the whole container offline.  Who 
knows what the actual cause was, but the problem is fixed and that's 
what I was looking for.

Like Mark S. said, many causes, one symptom.  That Dell trouble 
ticket is going to be the best way to get it solved.  Their Linux 
guys have seen it all and can escalate it to an engineer if they 
haven't.  They're going to ask you for the diagnostic output of 
afacli, so you'll want to get that installed if you haven't already.  
They can also swap in new components for you.

Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PERC3/DI aacraid failed disk detection slow

2005-03-13 Thread Andrew Kinney
[57]: at 5082220 sec
[58]: ID(0:04:0) Cmd[0x28] Fail: Block Range 15544320 : 1557
[59]: at 5082220 sec
[60]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5166793 : 5166794 at
[61]:  5082220 sec
[62]: RAID5 Container 0 Drive 0:4:0 Failure
[63]: ID(0:04:0): Timeout detected on cmd[0x28]
[64]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[65]: ID(0:04:0) Timeout detected on cmd[0x28]
[66]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[67]: ID(0:04:0): Timeout detected on cmd[0x28]
[68]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[69]: ID(0:04:0): Timeout detected on cmd[0x28]
[70]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[71]: ID(0:04:0): Timeout detected on cmd[0x28]
[72]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[73]: ID(0:04:0): Timeout detected on cmd[0x28]
[74]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[75]: ID(0:04:0): Timeout detected on cmd[0x28]
[76]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[77]: ID(0:04:0) Timeout detected on cmd[0x28]
[78]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[79]: ID(0:04:0) Cmd[0x28] Fail: Block Range 0 : 0 at 5082308 sec
[80]: 2 can't read mbr dev_t:4
[81]:  <...repeats 1 more times>
[82]: can't read config from slice #[4]
[83]: 2 can't read mbr dev_t:4
[84]: can't read config from slice #[4]
[85]: CT_LogMissingEntry: Log missing entry, container 0, dev 4,
[86]: signature 0x8f950a4d, nvEntry 65
[87]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[88]: 950a4d
[89]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[90]: 950a4d
[91]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[92]: 950a4d
[93]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[94]: 950a4d
[95]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[96]: 950a4d
[97]: RAID5 Failover Container 0 No Failover Assigned
[98]: Drive 0:4:0 returning error
[99]:
[/CODE]

88 seconds to determine the drive failed.  In other words, it took 88 
seconds from the time it stopped processing commands from the OS 
until it was ready to continue processing commands from the OS.  The 
kernel killed the storage at 60 seconds, thus hosing the OS since 
that was the only storage device.  Though the controller came back, 
the OS had already given up and couldn't recover.

Am I correct in assessing that the controller's firmware is 
responsible for this extended delay in detecting the failed disk?

Here's the information on our setup:

PERC3/DI on Dell PowerEdge 2500
5 disk U160 RAID5
AFA0> controller details
Executing: controller details
Controller Information
--
 Device Name: AFA0
 Controller Type: PERC 3/Di
 Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 4C20D2
 Number of Buses: 2
 Devices per Bus: 15
  Controller CPU: i960 R series
Controller CPU Speed: 100 Mhz
   Controller Memory: 128 Mbytes
   Battery State: Ok

Component Revisions
---
CLI: 2.8-0 (Build #6076)
API: 2.8-0 (Build #6076)
Miniport Driver: 1.1-4 (Build #)
Controller Software: 2.8-0 (Build #6092)
Controller BIOS: 2.8-0 (Build #6092)
Controller Firmware: (Build #6092)
 
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html